AI Infrastructure Practical Guide

Building an AI Agent Knowledge Base: MCP + RAG + Vector DB for Extended Local Memory

Give your AI agent a brain that remembers everything it researches. A practical guide to persistent, searchable memory using Model Context Protocol, retrieval-augmented generation, and vector databases.

Yaneth

25 min read · February 20, 2026

🎧 Listen

~10 min

📺 Watch the video version: MCP + RAG + Vector DB Guide

Introduction

Every AI agent has the same fundamental limitation: it forgets. Each conversation starts from zero. The research it did yesterday, the preferences it learned, the documents it analyzed — all gone when the session ends. Context windows are getting larger (Gemini's 2M tokens, Claude's 200K), but they're still finite, expensive to fill, and ephemeral.

What if your AI agent could build a permanent, searchable knowledge base from everything it encounters? Every web page it researches, every PDF it reads, every conversation insight — indexed, embedded, and retrievable in milliseconds. That's what we're building in this guide.

The stack is three technologies working together:

MCP (Model Context Protocol) — Anthropic's open standard that lets AI agents connect to external tools and data sources through a universal interface
RAG (Retrieval-Augmented Generation) — The technique of finding relevant information and injecting it into the AI's prompt before it answers
Vector Databases — Purpose-built storage that understands semantic similarity, not just keyword matching

By the end, you'll have a working system where your AI agent can store anything it learns and recall it when needed — across sessions, across topics, growing smarter over time.

The Problem: AI Amnesia

Let's be concrete about what we're solving. Consider an AI agent like OpenClaw running Claude:

Monday: You ask it to research vector databases. It reads 15 articles, compares 8 products, writes a summary. Brilliant work.
Wednesday: You ask "what did you find about Qdrant vs ChromaDB?" — blank stare. New session. All gone.
Friday: You ask it to write a blog post about vector databases. It starts from scratch, potentially reaching different conclusions than Monday.

The agent has no persistent memory. It's like having an incredibly smart colleague who gets amnesia every time they leave the room. File-based memory (like markdown notes) helps, but it doesn't scale — you can't semantic-search through 10,000 markdown files efficiently.

💡 What We Want An agent that can say: "Based on the 47 articles I've read about vector databases over the past 3 months, here's what I recommend for your use case…" — and actually have those 47 articles in its retrievable memory.

MCP: The Universal Plug for AI Agents

What Is MCP?

The Model Context Protocol (MCP) is an open standard released by Anthropic in November 2024. Think of it as USB-C for AI — a single, universal interface that connects any AI model to any external tool or data source.^[1]

Before MCP, connecting an AI agent to, say, a database required building a custom integration for each combination of AI model and data source. If you had 5 AI models and 10 data sources, that's 50 custom integrations. MCP collapses this to 5 + 10 = 15: each model implements MCP once, each data source implements MCP once, and they all work together.

How MCP Works

MCP uses a client-server architecture built on JSON-RPC 2.0, inspired by the Language Server Protocol (LSP) that powers IDE features like autocomplete:^[2]

MCP Client — Lives inside the AI application (Claude Desktop, OpenClaw, Cursor, etc.). It discovers and calls tools exposed by servers.
MCP Server — A lightweight process that exposes specific capabilities. It can provide tools (functions the AI can call), resources (data the AI can read), and prompts (templates for common operations).
Transport — Communication happens over stdio (local processes) or SSE/HTTP (remote servers).

When Claude Desktop connects to an MCP server, it discovers what tools are available and can call them as needed during conversation. The AI decides when to use a tool based on the user's question — it's not hardcoded.

Why Anthropic Created MCP

The origin story is pragmatic. Developer David Soria Parra was frustrated with constantly copying context between tools and AI assistants.^[3] Every integration was a one-off. MCP emerged as the universal solution — build the connector once, use it everywhere. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF) under the Linux Foundation, co-founded with Block and OpenAI, signaling it's becoming an industry standard, not just an Anthropic project.^[2]

MCP vs Traditional API Integrations

Aspect	Traditional APIs	MCP
Discovery	Manual — read docs, write code	Automatic — AI discovers available tools
Integration effort	Custom per API × per AI model	Implement once per side
Security	Varies wildly	Standardized capability negotiation
Composability	Each integration is isolated	AI can combine tools from multiple servers
Ecosystem	Fragmented	Growing — 10,000+ community servers

Existing MCP Servers for Knowledge & Memory

The MCP ecosystem already has servers relevant to building a knowledge base:

mcp-server-qdrant — Official Qdrant MCP server. Store and retrieve semantic memories. Configure collection name, embedding model, and search parameters. Drop-in knowledge base.^[4]
rag-mcp-server (PyPI) — Generic RAG server supporting multiple embedding models including multilingual options. Initialize a knowledge base from a directory, then search it.^[5]
mcp-server-filesystem — Read/write local files. The simplest "memory" — save notes as files. Limited to keyword search.
postgres-mcp — Full PostgreSQL access via MCP. Combined with pgvector, this becomes a knowledge base with SQL power.^[6]
knowledge-rag (Lobehub) — Document ingestion + vector search with configurable chunking.

🎯 The Fastest Path If you just want a working knowledge base today: install mcp-server-qdrant, run Qdrant locally via Docker, and configure it in Claude Desktop or OpenClaw. Your agent can immediately start storing and retrieving information semantically. We'll build a custom one later for more control.

RAG: How Retrieval-Augmented Generation Works

RAG is the technique that makes this all work. Instead of hoping the AI "knows" something from training, you find relevant information and inject it into the prompt before the AI generates a response. It's like giving the AI an open-book exam instead of a closed-book one.

The RAG pipeline has two phases:

Ingestion Pipeline (Offline)

Ingest — Load documents (PDFs, web pages, markdown, tweets, code files)
Chunk — Split documents into smaller pieces (typically 200–1000 tokens each)
Embed — Convert each chunk into a high-dimensional vector using an embedding model
Store — Save vectors + original text + metadata in a vector database

Query Pipeline (Online)

Embed query — Convert the user's question into a vector using the same embedding model
Search — Find the most similar vectors in the database (cosine similarity / dot product)
Rerank (optional) — Use a cross-encoder to re-score results for better precision
Augment — Inject the top-k retrieved chunks into the LLM prompt as context
Generate — The LLM answers using both its training knowledge and the retrieved context

Chunking Strategies

How you split documents determines retrieval quality more than almost any other decision. Get this wrong and your agent retrieves garbage; get it right and it's magic.

Strategy	How It Works	Best For	Pitfall
Fixed-size	Split every N tokens/characters with overlap	Simple, predictable. Good starting point	Cuts mid-sentence, breaks context
Recursive	Split by paragraphs → sentences → words, recursively	Most content types. LangChain default	Still structural, not semantic
Semantic	Embed sentences, split where similarity drops	Long documents with topic shifts	Slower, requires embedding calls during ingestion
Heading-aware	Split on markdown/HTML headers	Structured docs, READMEs, wikis	Sections may be too large or too small
Agentic / Proposition	LLM extracts atomic facts from text	Highest quality, research applications	Expensive — requires LLM call per chunk

💡 Practical Advice Start with recursive chunking at 512 tokens with 50-token overlap. This works for 80% of use cases. Semantic chunking improves recall by up to 9% but adds complexity.^[7] Only use proposition chunking for high-value, small corpora where quality matters more than cost.

Embedding Models

The embedding model converts text into vectors. The quality of these vectors directly determines retrieval accuracy. Here's the current landscape:

Model	Dimensions	Context	Cost	Notes
OpenAI text-embedding-3-small	1536	8191 tokens	$0.02 / 1M tokens	Best value. Great for most use cases
OpenAI text-embedding-3-large	3072	8191 tokens	$0.13 / 1M tokens	Higher quality, 6.5× more expensive
Voyage-3-large	1024	32K tokens	$0.18 / 1M tokens	Top MTEB scores, outperforms OpenAI by 9-20%^[7]
Cohere embed-v3	1024	512 tokens	$0.10 / 1M tokens	Built-in search/classification modes
nomic-embed-text (local)	768	8192 tokens	Free (local)	Best open-source option. Runs on CPU
BGE-large-en-v1.5 (local)	1024	512 tokens	Free (local)	Strong MTEB scores, BAAI model
Granite-embedding-278m	768	512 tokens	Free (local)	IBM model, multilingual support

💡 Our Pick: OpenAI text-embedding-3-small For a personal knowledge base, cost is the main concern. At $0.02 per million tokens, you can embed 50,000 document chunks for about $0.50. The quality is excellent for general-purpose retrieval. If you want zero cost and full privacy, use nomic-embed-text via Ollama locally.

Retrieval Strategies & Reranking

Similarity Search

The simplest approach: embed the query, find the K nearest vectors by cosine similarity. Fast, works well for straightforward questions. This is what 90% of RAG implementations use.

Hybrid Search (BM25 + Vector)

Combines traditional keyword matching (BM25/TF-IDF) with semantic vector search. This catches cases where the user asks for a specific term that semantic search might miss. For example, searching for "pgvector" — keyword search finds exact matches, while vector search finds conceptually similar content about "PostgreSQL vector extension."

Weaviate and Qdrant have hybrid search built-in. For others, you implement it by running both searches and merging results with reciprocal rank fusion (RRF).

Reranking

After initial retrieval (say, top 20 results), a cross-encoder reranker scores each result against the original query more carefully. This is slower but dramatically improves precision. Popular rerankers:

Cohere Rerank — API-based, easy to use, excellent quality
bge-reranker-v2-m3 — Open-source, run locally
Pinecone Rerank — Built into Pinecone's pipeline

⚠️ Don't Skip Reranking for Production Initial vector search has ~70-80% precision. Adding a reranker bumps this to 90%+. The difference between "pretty good" and "actually useful" for your agent.

Vector Database Comparison

This is the most-asked question in the RAG space. Here's an honest comparison based on real-world usage, community sentiment, and benchmarks:

Database	Type	Language	Hybrid Search	Best For	Free Tier
ChromaDB	Embedded	Python	Basic	Prototyping, small projects	Fully open-source
Qdrant	Client-server	Rust	Yes (sparse + dense)	Production local/cloud, fast filtering	Open-source + 1GB cloud free
Weaviate	Client-server	Go	Built-in (best)	Knowledge graphs, hybrid search	Open-source + sandbox
Pinecone	Managed cloud	—	Yes	Zero-ops, scaling without thinking	Serverless free tier
pgvector	PG extension	C	With tsvector	Already using PostgreSQL	Free (PG extension)
Milvus	Distributed	Go/C++	Yes	Enterprise, billions of vectors	Open-source (Zilliz cloud)
FAISS	Library	C++ (Python)	No	Research, in-memory speed	Free (Meta library)
LanceDB	Embedded	Rust	Yes	Serverless, multimodal	Fully open-source

Which Should You Pick?

🟢 Just Getting Started → ChromaDB

pip install chromadb and you're running. No Docker, no config. Perfect for prototyping your RAG pipeline. Limitation: not designed for production scale or concurrent access.

🔵 Local Production → Qdrant

docker run -p 6333:6333 qdrant/qdrant and you have a production-grade vector DB running locally. Written in Rust, so it's fast. Has an official MCP server. Best balance of ease-of-use, performance, and features. The Reddit consensus backs this.^[8]

🟣 Already Using Postgres → pgvector

Don't add another database. Install the extension: CREATE EXTENSION vector;. Your vectors live alongside your relational data. Supports HNSW and IVFFlat indexes. Performance is good for up to ~50-100M vectors.

The Complete Architecture

Here's what we're building — an AI agent (OpenClaw/Claude) with persistent, searchable memory that grows over time:

┌─────────────────────────────────────────────────────────────────┐ │ AI AGENT (Claude / OpenClaw) │ │ │ │ User asks: "What did we learn about Qdrant pricing?" │ │ │ │ │ MCP Client │ │ │ │ └──────────────────────────┼──────────────────────────────────────┘ │ JSON-RPC (stdio) ┌──────────────────────────┼──────────────────────────────────────┐ │ MCP Knowledge Server │ │ │ │ │ ┌──────────┐ ┌──────┴──────┐ ┌───────────┐ │ │ │ Ingest │ │ Search │ │ Manage │ │ │ │ Tool │ │ Tool │ │ Tool │ │ │ └────┬─────┘ └──────┬──────┘ └───────────┘ │ │ │ │ │ │ ┌────┴─────────────────┴──────────────────────┐ │ │ │ RAG Pipeline │ │ │ │ chunk → embed → store query → retrieve │ │ │ └────────────────┬─────────────────────────────┘ │ │ │ │ └────────────────────┼────────────────────────────────────────────┘ │ ┌────────────────────┼────────────────────────────────────────────┐ │ Vector Database (Qdrant / ChromaDB) │ │ │ │ Collection: "knowledge_base" │ │ ┌─────────┬────────────┬──────────────────────┐ │ │ │ Vector │ Payload │ Metadata │ │ │ │ [0.1,.] │ "Qdrant's │ {source: "article", │ │ │ │ │ free tier │ url: "...", │ │ │ │ │ includes" │ date: "2026-02-20"} │ │ │ └─────────┴────────────┴──────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘

Data Pipeline: Ingest → Chunk → Embed → Store

When the agent encounters useful information (articles, docs, conversations), it calls the MCP server's ingest tool:

Ingest: Accept raw text, URL, file path, or HTML
Parse: Extract clean text (strip HTML, extract PDF text)
Chunk: Split into ~512-token pieces with overlap
Embed: Send chunks to embedding API, get vectors back
Store: Upsert vectors + text + metadata into vector DB

Query Pipeline: Question → Retrieve → Augment

When the agent needs to answer a question, it calls the search tool:

Embed query: Convert the question to a vector
Search: Find top-K similar chunks (K=5-10)
Filter: Apply metadata filters (date range, source type)
Return: Send relevant chunks back to the agent as context

Getting Started: The Simplest Setup

Let's build the simplest possible knowledge base: ChromaDB + OpenAI embeddings + a custom MCP server. Total setup time: ~15 minutes.

Prerequisites

Python 3.10+
An OpenAI API key (for embeddings)
Claude Desktop, OpenClaw, or any MCP-compatible client

Step 1: Install Dependencies

pip install chromadb openai mcp[cli] beautifulsoup4 pypdf2

Step 2: Create the MCP Server

This server exposes three tools: ingest_text, ingest_url, and search.

Complete Code: Knowledge Base MCP Server

# knowledge_mcp_server.py
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import chromadb
from openai import OpenAI
import json, hashlib, re
from bs4 import BeautifulSoup
import urllib.request

# --- Configuration ---
EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "knowledge_base"
CHUNK_SIZE = 512       # tokens (roughly 4 chars per token)
CHUNK_OVERLAP = 50     # token overlap between chunks
TOP_K = 8              # number of results to return

# --- Init ---
openai_client = OpenAI()  # uses OPENAI_API_KEY env var
chroma_client = chromadb.PersistentClient(path="./knowledge_db")
collection = chroma_client.get_or_create_collection(
    name=COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"}
)

server = Server("knowledge-base")

def chunk_text(text: str, chunk_size: int = CHUNK_SIZE,
               overlap: int = CHUNK_OVERLAP) -> list[str]:
    """Split text into overlapping chunks by approximate token count."""
    words = text.split()
    tokens_per_chunk = chunk_size * 4  # rough char estimate
    chars_overlap = overlap * 4
    chunks = []
    start = 0
    text_clean = " ".join(words)
    while start < len(text_clean):
        end = start + tokens_per_chunk
        chunk = text_clean[start:end]
        if chunk.strip():
            chunks.append(chunk.strip())
        start = end - chars_overlap
    return chunks if chunks else [text_clean[:tokens_per_chunk]]

def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Get embeddings from OpenAI."""
    response = openai_client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=texts
    )
    return [e.embedding for e in response.data]

def ingest(text: str, source: str = "manual",
           url: str = "", metadata: dict = None):
    """Chunk, embed, and store text."""
    chunks = chunk_text(text)
    embeddings = get_embeddings(chunks)
    ids = [hashlib.md5(
        (c + source).encode()).hexdigest() for c in chunks]
    metadatas = [
        {"source": source, "url": url,
         "chunk_index": i, "total_chunks": len(chunks),
         **(metadata or {})}
        for i in range(len(chunks))
    ]
    collection.upsert(
        ids=ids, embeddings=embeddings,
        documents=chunks, metadatas=metadatas
    )
    return len(chunks)

@server.list_tools()
async def list_tools():
    return [
        Tool(name="ingest_text",
             description="Store text in the knowledge base "
                         "for future retrieval.",
             inputSchema={
                 "type": "object",
                 "properties": {
                     "text": {"type": "string",
                              "description": "Text to store"},
                     "source": {"type": "string",
                                "description": "Source label"},
                     "url": {"type": "string",
                             "description": "Source URL"}
                 },
                 "required": ["text"]
             }),
        Tool(name="ingest_url",
             description="Fetch a URL, extract text, and store "
                         "in the knowledge base.",
             inputSchema={
                 "type": "object",
                 "properties": {
                     "url": {"type": "string",
                             "description": "URL to fetch"}
                 },
                 "required": ["url"]
             }),
        Tool(name="search",
             description="Search the knowledge base for "
                         "information relevant to a query.",
             inputSchema={
                 "type": "object",
                 "properties": {
                     "query": {"type": "string",
                               "description": "Search query"},
                     "n_results": {"type": "integer",
                                   "description": "Max results",
                                   "default": TOP_K}
                 },
                 "required": ["query"]
             }),
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "ingest_text":
        n = ingest(arguments["text"],
                   arguments.get("source", "manual"),
                   arguments.get("url", ""))
        return [TextContent(type="text",
                text=f"Stored {n} chunks in knowledge base.")]

    elif name == "ingest_url":
        url = arguments["url"]
        req = urllib.request.Request(url, headers={
            "User-Agent": "KnowledgeBot/1.0"})
        html = urllib.request.urlopen(req).read().decode()
        soup = BeautifulSoup(html, "html.parser")
        for tag in soup(["script", "style", "nav", "footer"]):
            tag.decompose()
        text = soup.get_text(separator="\n", strip=True)
        n = ingest(text, source="web", url=url)
        return [TextContent(type="text",
                text=f"Fetched and stored {n} chunks from {url}")]

    elif name == "search":
        query = arguments["query"]
        n_results = arguments.get("n_results", TOP_K)
        q_embedding = get_embeddings([query])[0]
        results = collection.query(
            query_embeddings=[q_embedding],
            n_results=n_results,
            include=["documents", "metadatas", "distances"]
        )
        output = []
        for i, doc in enumerate(results["documents"][0]):
            meta = results["metadatas"][0][i]
            dist = results["distances"][0][i]
            relevance = round(1 - dist, 3)
            output.append(
                f"[{relevance}] {doc[:500]}\n"
                f"  Source: {meta.get('source', '?')} "
                f"| URL: {meta.get('url', 'N/A')}")
        return [TextContent(type="text",
                text="\n\n".join(output) if output
                     else "No relevant results found.")]

async def main():
    async with stdio_server() as (read, write):
        await server.run(read, write,
                         server.create_initialization_options())

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Step 3: Configure Your AI Client

Add to Claude Desktop's claude_desktop_config.json:

{
  "mcpServers": {
    "knowledge-base": {
      "command": "python",
      "args": ["knowledge_mcp_server.py"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

For OpenClaw, add to your MCP configuration:

# In your OpenClaw config
mcp:
  servers:
    knowledge-base:
      command: python
      args: ["/path/to/knowledge_mcp_server.py"]
      env:
        OPENAI_API_KEY: "sk-..."

Step 4: Use It

Now your agent has three new capabilities:

"Store this article about Kubernetes security" → Agent calls ingest_url
"Remember that pgvector supports HNSW indexes" → Agent calls ingest_text
"What do we know about vector database performance?" → Agent calls search, gets relevant chunks, incorporates them into its response

Ingesting Different Data Types

# HTML/Web pages — handled by ingest_url
# For PDFs:
from pypdf2 import PdfReader
reader = PdfReader("document.pdf")
text = "\n".join(page.extract_text() for page in reader.pages)
ingest(text, source="pdf", url="document.pdf")

# For Markdown files:
with open("notes.md") as f:
    ingest(f.read(), source="markdown", url="notes.md")

# For tweets/posts (pass as text):
ingest("Thread by @karpathy: The hottest new programming "
       "language is English...",
       source="twitter", url="https://x.com/karpathy/...")

RAG Pitfalls & How to Avoid Them

1. Chunk Size Mismatch

Too small (50 tokens): Chunks lack context. "It costs $20/month" — what costs $20/month? Too large (2000 tokens): Chunks contain too many topics, reducing precision. Sweet spot: 256–512 tokens for most content.

2. Irrelevant Retrieval

The top-K results look relevant by embedding similarity but don't actually answer the question. Fix: Add a relevance threshold (drop results below 0.7 cosine similarity). Add reranking.

3. Hallucination from Bad Context

The AI gets retrieved chunks that are tangentially related and confabulates an answer that sounds right but isn't grounded. Fix: Instruct the model to cite which retrieved chunk supports each claim. Use "answer only from the provided context" system prompts.

4. Stale Data

Your knowledge base has outdated information but no way to know. Fix: Store ingestion timestamps as metadata. Let the agent filter by recency. Implement a refresh/re-ingest pipeline for important sources.

5. Embedding Model Mismatch

You embedded documents with model A and query with model B. Vectors live in incompatible spaces. Fix: Never mix embedding models. If you switch models, re-embed everything.

What the Community Says

Hacker News

The HN community has been actively discussing these technologies:

"Forget Vector Databases: RAG with Just SQL and LLM" — RisingWave's argument that you can do RAG with SQL streaming, no vector DB needed. Interesting but niche.^[9]
"I accidentally built SQLite for AI memory (Memvid)" — A project storing memories as video-encoded vectors. Creative approach to the persistence problem.^[10]
"Clamp — Git-like version control for RAG vector databases" — Versioning for your knowledge base, solving the "oops I corrupted my embeddings" problem.
MCP discussions are exploding — MindsDB, PostgreSQL, and dozens of other tools adding MCP support as it becomes the de facto standard.

Reddit (r/MachineLearning)

The consensus on vector database choice from the ML community:^[8]

"Qdrant replaced Milvus and I've had no issues" — common sentiment
"FAISS is NOT a vector database. It's a lower-level index" — important distinction
"Pinecone has vendor lock-in; I tend to avoid these" — preference for open-source
ChromaDB getting better but still seen as prototyping-only by many

GitHub Trending

The major RAG frameworks being used in production:

LlamaIndex — Purpose-built for RAG. Best for connecting LLMs to custom data. Excellent MCP integration.
LangChain — General-purpose LLM framework. Larger ecosystem but more complexity. Good for chains of operations.
Haystack (deepset) — Production-focused. Strong on evaluation and pipeline composition.
RAGFlow — Open-source RAG engine with built-in document processing and chunking.

🎯 Our Take For building an agent knowledge base specifically, skip the frameworks and build directly with chromadb + openai + mcp. The frameworks add abstraction layers you don't need for this use case. When your needs grow, LlamaIndex is the natural next step — it was literally designed for "connect your LLM to your data."

References

Introducing the Model Context Protocol — Anthropic, November 2024
Model Context Protocol — Wikipedia — Overview, history, and AAIF donation
A Year of MCP: From Internal Experiment to Industry Standard — Pento AI, 2025
Vibe Coding RAG with MCP Server — Qdrant Blog
rag-mcp-server — PyPI package for RAG via MCP
postgres-mcp — PostgreSQL MCP server, GitHub
RAG Infrastructure: Building Production Systems — Introl, December 2025
What's the best Vector DB? — r/MachineLearning, February 2025
Forget Vector Databases: RAG with Just SQL and LLM — RisingWave
Memvid — SQLite for AI Memory — GitHub
MCP Specification — modelcontextprotocol.io
Best Vector Databases in 2025: A Complete Comparison — Firecrawl
Integrating Agentic RAG with MCP Servers — Omar Santos, Medium
How to Build a Python MCP Server to Consult a Knowledge Base — Auth0 Blog