Introduction
Every AI agent has the same fundamental limitation: it forgets. Each conversation starts from zero. The research it did yesterday, the preferences it learned, the documents it analyzed โ all gone when the session ends. Context windows are getting larger (Gemini's 2M tokens, Claude's 200K), but they're still finite, expensive to fill, and ephemeral.
What if your AI agent could build a permanent, searchable knowledge base from everything it encounters? Every web page it researches, every PDF it reads, every conversation insight โ indexed, embedded, and retrievable in milliseconds. That's what we're building in this guide.
The stack is three technologies working together:
- MCP (Model Context Protocol) โ Anthropic's open standard that lets AI agents connect to external tools and data sources through a universal interface
- RAG (Retrieval-Augmented Generation) โ The technique of finding relevant information and injecting it into the AI's prompt before it answers
- Vector Databases โ Purpose-built storage that understands semantic similarity, not just keyword matching
By the end, you'll have a working system where your AI agent can store anything it learns and recall it when needed โ across sessions, across topics, growing smarter over time.
The Problem: AI Amnesia
Let's be concrete about what we're solving. Consider an AI agent like OpenClaw running Claude:
- Monday: You ask it to research vector databases. It reads 15 articles, compares 8 products, writes a summary. Brilliant work.
- Wednesday: You ask "what did you find about Qdrant vs ChromaDB?" โ blank stare. New session. All gone.
- Friday: You ask it to write a blog post about vector databases. It starts from scratch, potentially reaching different conclusions than Monday.
The agent has no persistent memory. It's like having an incredibly smart colleague who gets amnesia every time they leave the room. File-based memory (like markdown notes) helps, but it doesn't scale โ you can't semantic-search through 10,000 markdown files efficiently.
MCP: The Universal Plug for AI Agents
What Is MCP?
The Model Context Protocol (MCP) is an open standard released by Anthropic in November 2024. Think of it as USB-C for AI โ a single, universal interface that connects any AI model to any external tool or data source.[1]
Before MCP, connecting an AI agent to, say, a database required building a custom integration for each combination of AI model and data source. If you had 5 AI models and 10 data sources, that's 50 custom integrations. MCP collapses this to 5 + 10 = 15: each model implements MCP once, each data source implements MCP once, and they all work together.
How MCP Works
MCP uses a client-server architecture built on JSON-RPC 2.0, inspired by the Language Server Protocol (LSP) that powers IDE features like autocomplete:[2]
- MCP Client โ Lives inside the AI application (Claude Desktop, OpenClaw, Cursor, etc.). It discovers and calls tools exposed by servers.
- MCP Server โ A lightweight process that exposes specific capabilities. It can provide tools (functions the AI can call), resources (data the AI can read), and prompts (templates for common operations).
- Transport โ Communication happens over
stdio(local processes) orSSE/HTTP(remote servers).
When Claude Desktop connects to an MCP server, it discovers what tools are available and can call them as needed during conversation. The AI decides when to use a tool based on the user's question โ it's not hardcoded.
Why Anthropic Created MCP
The origin story is pragmatic. Developer David Soria Parra was frustrated with constantly copying context between tools and AI assistants.[3] Every integration was a one-off. MCP emerged as the universal solution โ build the connector once, use it everywhere. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF) under the Linux Foundation, co-founded with Block and OpenAI, signaling it's becoming an industry standard, not just an Anthropic project.[2]
MCP vs Traditional API Integrations
| Aspect | Traditional APIs | MCP |
|---|---|---|
| Discovery | Manual โ read docs, write code | Automatic โ AI discovers available tools |
| Integration effort | Custom per API ร per AI model | Implement once per side |
| Security | Varies wildly | Standardized capability negotiation |
| Composability | Each integration is isolated | AI can combine tools from multiple servers |
| Ecosystem | Fragmented | Growing โ 10,000+ community servers |
Existing MCP Servers for Knowledge & Memory
The MCP ecosystem already has servers relevant to building a knowledge base:
- mcp-server-qdrant โ Official Qdrant MCP server. Store and retrieve semantic memories. Configure collection name, embedding model, and search parameters. Drop-in knowledge base.[4]
- rag-mcp-server (PyPI) โ Generic RAG server supporting multiple embedding models including multilingual options. Initialize a knowledge base from a directory, then search it.[5]
- mcp-server-filesystem โ Read/write local files. The simplest "memory" โ save notes as files. Limited to keyword search.
- postgres-mcp โ Full PostgreSQL access via MCP. Combined with pgvector, this becomes a knowledge base with SQL power.[6]
- knowledge-rag (Lobehub) โ Document ingestion + vector search with configurable chunking.
mcp-server-qdrant, run Qdrant locally via Docker, and configure it in Claude Desktop or OpenClaw. Your agent can immediately start storing and retrieving information semantically. We'll build a custom one later for more control.
RAG: How Retrieval-Augmented Generation Works
RAG is the technique that makes this all work. Instead of hoping the AI "knows" something from training, you find relevant information and inject it into the prompt before the AI generates a response. It's like giving the AI an open-book exam instead of a closed-book one.
The RAG pipeline has two phases:
Ingestion Pipeline (Offline)
- Ingest โ Load documents (PDFs, web pages, markdown, tweets, code files)
- Chunk โ Split documents into smaller pieces (typically 200โ1000 tokens each)
- Embed โ Convert each chunk into a high-dimensional vector using an embedding model
- Store โ Save vectors + original text + metadata in a vector database
Query Pipeline (Online)
- Embed query โ Convert the user's question into a vector using the same embedding model
- Search โ Find the most similar vectors in the database (cosine similarity / dot product)
- Rerank (optional) โ Use a cross-encoder to re-score results for better precision
- Augment โ Inject the top-k retrieved chunks into the LLM prompt as context
- Generate โ The LLM answers using both its training knowledge and the retrieved context
Chunking Strategies
How you split documents determines retrieval quality more than almost any other decision. Get this wrong and your agent retrieves garbage; get it right and it's magic.
| Strategy | How It Works | Best For | Pitfall |
|---|---|---|---|
| Fixed-size | Split every N tokens/characters with overlap | Simple, predictable. Good starting point | Cuts mid-sentence, breaks context |
| Recursive | Split by paragraphs โ sentences โ words, recursively | Most content types. LangChain default | Still structural, not semantic |
| Semantic | Embed sentences, split where similarity drops | Long documents with topic shifts | Slower, requires embedding calls during ingestion |
| Heading-aware | Split on markdown/HTML headers | Structured docs, READMEs, wikis | Sections may be too large or too small |
| Agentic / Proposition | LLM extracts atomic facts from text | Highest quality, research applications | Expensive โ requires LLM call per chunk |
Embedding Models
The embedding model converts text into vectors. The quality of these vectors directly determines retrieval accuracy. Here's the current landscape:
| Model | Dimensions | Context | Cost | Notes |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | 8191 tokens | $0.02 / 1M tokens | Best value. Great for most use cases |
| OpenAI text-embedding-3-large | 3072 | 8191 tokens | $0.13 / 1M tokens | Higher quality, 6.5ร more expensive |
| Voyage-3-large | 1024 | 32K tokens | $0.18 / 1M tokens | Top MTEB scores, outperforms OpenAI by 9-20%[7] |
| Cohere embed-v3 | 1024 | 512 tokens | $0.10 / 1M tokens | Built-in search/classification modes |
| nomic-embed-text (local) | 768 | 8192 tokens | Free (local) | Best open-source option. Runs on CPU |
| BGE-large-en-v1.5 (local) | 1024 | 512 tokens | Free (local) | Strong MTEB scores, BAAI model |
| Granite-embedding-278m | 768 | 512 tokens | Free (local) | IBM model, multilingual support |
nomic-embed-text via Ollama locally.
Retrieval Strategies & Reranking
Similarity Search
The simplest approach: embed the query, find the K nearest vectors by cosine similarity. Fast, works well for straightforward questions. This is what 90% of RAG implementations use.
Hybrid Search (BM25 + Vector)
Combines traditional keyword matching (BM25/TF-IDF) with semantic vector search. This catches cases where the user asks for a specific term that semantic search might miss. For example, searching for "pgvector" โ keyword search finds exact matches, while vector search finds conceptually similar content about "PostgreSQL vector extension."
Weaviate and Qdrant have hybrid search built-in. For others, you implement it by running both searches and merging results with reciprocal rank fusion (RRF).
Reranking
After initial retrieval (say, top 20 results), a cross-encoder reranker scores each result against the original query more carefully. This is slower but dramatically improves precision. Popular rerankers:
- Cohere Rerank โ API-based, easy to use, excellent quality
- bge-reranker-v2-m3 โ Open-source, run locally
- Pinecone Rerank โ Built into Pinecone's pipeline
Vector Database Comparison
This is the most-asked question in the RAG space. Here's an honest comparison based on real-world usage, community sentiment, and benchmarks:
| Database | Type | Language | Hybrid Search | Best For | Free Tier |
|---|---|---|---|---|---|
| ChromaDB | Embedded | Python | Basic | Prototyping, small projects | Fully open-source |
| Qdrant | Client-server | Rust | Yes (sparse + dense) | Production local/cloud, fast filtering | Open-source + 1GB cloud free |
| Weaviate | Client-server | Go | Built-in (best) | Knowledge graphs, hybrid search | Open-source + sandbox |
| Pinecone | Managed cloud | โ | Yes | Zero-ops, scaling without thinking | Serverless free tier |
| pgvector | PG extension | C | With tsvector | Already using PostgreSQL | Free (PG extension) |
| Milvus | Distributed | Go/C++ | Yes | Enterprise, billions of vectors | Open-source (Zilliz cloud) |
| FAISS | Library | C++ (Python) | No | Research, in-memory speed | Free (Meta library) |
| LanceDB | Embedded | Rust | Yes | Serverless, multimodal | Fully open-source |
Which Should You Pick?
๐ข Just Getting Started โ ChromaDB
pip install chromadb and you're running. No Docker, no config. Perfect for prototyping your RAG pipeline. Limitation: not designed for production scale or concurrent access.
๐ต Local Production โ Qdrant
docker run -p 6333:6333 qdrant/qdrant and you have a production-grade vector DB running locally. Written in Rust, so it's fast. Has an official MCP server. Best balance of ease-of-use, performance, and features. The Reddit consensus backs this.[8]
๐ฃ Already Using Postgres โ pgvector
Don't add another database. Install the extension: CREATE EXTENSION vector;. Your vectors live alongside your relational data. Supports HNSW and IVFFlat indexes. Performance is good for up to ~50-100M vectors.
The Complete Architecture
Here's what we're building โ an AI agent (OpenClaw/Claude) with persistent, searchable memory that grows over time:
Data Pipeline: Ingest โ Chunk โ Embed โ Store
When the agent encounters useful information (articles, docs, conversations), it calls the MCP server's ingest tool:
- Ingest: Accept raw text, URL, file path, or HTML
- Parse: Extract clean text (strip HTML, extract PDF text)
- Chunk: Split into ~512-token pieces with overlap
- Embed: Send chunks to embedding API, get vectors back
- Store: Upsert vectors + text + metadata into vector DB
Query Pipeline: Question โ Retrieve โ Augment
When the agent needs to answer a question, it calls the search tool:
- Embed query: Convert the question to a vector
- Search: Find top-K similar chunks (K=5-10)
- Filter: Apply metadata filters (date range, source type)
- Return: Send relevant chunks back to the agent as context
Getting Started: The Simplest Setup
Let's build the simplest possible knowledge base: ChromaDB + OpenAI embeddings + a custom MCP server. Total setup time: ~15 minutes.
Prerequisites
- Python 3.10+
- An OpenAI API key (for embeddings)
- Claude Desktop, OpenClaw, or any MCP-compatible client
Step 1: Install Dependencies
pip install chromadb openai mcp[cli] beautifulsoup4 pypdf2
Step 2: Create the MCP Server
This server exposes three tools: ingest_text, ingest_url, and search.
Complete Code: Knowledge Base MCP Server
# knowledge_mcp_server.py
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import chromadb
from openai import OpenAI
import json, hashlib, re
from bs4 import BeautifulSoup
import urllib.request
# --- Configuration ---
EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "knowledge_base"
CHUNK_SIZE = 512 # tokens (roughly 4 chars per token)
CHUNK_OVERLAP = 50 # token overlap between chunks
TOP_K = 8 # number of results to return
# --- Init ---
openai_client = OpenAI() # uses OPENAI_API_KEY env var
chroma_client = chromadb.PersistentClient(path="./knowledge_db")
collection = chroma_client.get_or_create_collection(
name=COLLECTION_NAME,
metadata={"hnsw:space": "cosine"}
)
server = Server("knowledge-base")
def chunk_text(text: str, chunk_size: int = CHUNK_SIZE,
overlap: int = CHUNK_OVERLAP) -> list[str]:
"""Split text into overlapping chunks by approximate token count."""
words = text.split()
tokens_per_chunk = chunk_size * 4 # rough char estimate
chars_overlap = overlap * 4
chunks = []
start = 0
text_clean = " ".join(words)
while start < len(text_clean):
end = start + tokens_per_chunk
chunk = text_clean[start:end]
if chunk.strip():
chunks.append(chunk.strip())
start = end - chars_overlap
return chunks if chunks else [text_clean[:tokens_per_chunk]]
def get_embeddings(texts: list[str]) -> list[list[float]]:
"""Get embeddings from OpenAI."""
response = openai_client.embeddings.create(
model=EMBEDDING_MODEL,
input=texts
)
return [e.embedding for e in response.data]
def ingest(text: str, source: str = "manual",
url: str = "", metadata: dict = None):
"""Chunk, embed, and store text."""
chunks = chunk_text(text)
embeddings = get_embeddings(chunks)
ids = [hashlib.md5(
(c + source).encode()).hexdigest() for c in chunks]
metadatas = [
{"source": source, "url": url,
"chunk_index": i, "total_chunks": len(chunks),
**(metadata or {})}
for i in range(len(chunks))
]
collection.upsert(
ids=ids, embeddings=embeddings,
documents=chunks, metadatas=metadatas
)
return len(chunks)
@server.list_tools()
async def list_tools():
return [
Tool(name="ingest_text",
description="Store text in the knowledge base "
"for future retrieval.",
inputSchema={
"type": "object",
"properties": {
"text": {"type": "string",
"description": "Text to store"},
"source": {"type": "string",
"description": "Source label"},
"url": {"type": "string",
"description": "Source URL"}
},
"required": ["text"]
}),
Tool(name="ingest_url",
description="Fetch a URL, extract text, and store "
"in the knowledge base.",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string",
"description": "URL to fetch"}
},
"required": ["url"]
}),
Tool(name="search",
description="Search the knowledge base for "
"information relevant to a query.",
inputSchema={
"type": "object",
"properties": {
"query": {"type": "string",
"description": "Search query"},
"n_results": {"type": "integer",
"description": "Max results",
"default": TOP_K}
},
"required": ["query"]
}),
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "ingest_text":
n = ingest(arguments["text"],
arguments.get("source", "manual"),
arguments.get("url", ""))
return [TextContent(type="text",
text=f"Stored {n} chunks in knowledge base.")]
elif name == "ingest_url":
url = arguments["url"]
req = urllib.request.Request(url, headers={
"User-Agent": "KnowledgeBot/1.0"})
html = urllib.request.urlopen(req).read().decode()
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)
n = ingest(text, source="web", url=url)
return [TextContent(type="text",
text=f"Fetched and stored {n} chunks from {url}")]
elif name == "search":
query = arguments["query"]
n_results = arguments.get("n_results", TOP_K)
q_embedding = get_embeddings([query])[0]
results = collection.query(
query_embeddings=[q_embedding],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
output = []
for i, doc in enumerate(results["documents"][0]):
meta = results["metadatas"][0][i]
dist = results["distances"][0][i]
relevance = round(1 - dist, 3)
output.append(
f"[{relevance}] {doc[:500]}\n"
f" Source: {meta.get('source', '?')} "
f"| URL: {meta.get('url', 'N/A')}")
return [TextContent(type="text",
text="\n\n".join(output) if output
else "No relevant results found.")]
async def main():
async with stdio_server() as (read, write):
await server.run(read, write,
server.create_initialization_options())
if __name__ == "__main__":
import asyncio
asyncio.run(main())
Step 3: Configure Your AI Client
Add to Claude Desktop's claude_desktop_config.json:
{
"mcpServers": {
"knowledge-base": {
"command": "python",
"args": ["knowledge_mcp_server.py"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
For OpenClaw, add to your MCP configuration:
# In your OpenClaw config
mcp:
servers:
knowledge-base:
command: python
args: ["/path/to/knowledge_mcp_server.py"]
env:
OPENAI_API_KEY: "sk-..."
Step 4: Use It
Now your agent has three new capabilities:
- "Store this article about Kubernetes security" โ Agent calls
ingest_url - "Remember that pgvector supports HNSW indexes" โ Agent calls
ingest_text - "What do we know about vector database performance?" โ Agent calls
search, gets relevant chunks, incorporates them into its response
Ingesting Different Data Types
# HTML/Web pages โ handled by ingest_url
# For PDFs:
from pypdf2 import PdfReader
reader = PdfReader("document.pdf")
text = "\n".join(page.extract_text() for page in reader.pages)
ingest(text, source="pdf", url="document.pdf")
# For Markdown files:
with open("notes.md") as f:
ingest(f.read(), source="markdown", url="notes.md")
# For tweets/posts (pass as text):
ingest("Thread by @karpathy: The hottest new programming "
"language is English...",
source="twitter", url="https://x.com/karpathy/...")
RAG Pitfalls & How to Avoid Them
1. Chunk Size Mismatch
Too small (50 tokens): Chunks lack context. "It costs $20/month" โ what costs $20/month? Too large (2000 tokens): Chunks contain too many topics, reducing precision. Sweet spot: 256โ512 tokens for most content.
2. Irrelevant Retrieval
The top-K results look relevant by embedding similarity but don't actually answer the question. Fix: Add a relevance threshold (drop results below 0.7 cosine similarity). Add reranking.
3. Hallucination from Bad Context
The AI gets retrieved chunks that are tangentially related and confabulates an answer that sounds right but isn't grounded. Fix: Instruct the model to cite which retrieved chunk supports each claim. Use "answer only from the provided context" system prompts.
4. Stale Data
Your knowledge base has outdated information but no way to know. Fix: Store ingestion timestamps as metadata. Let the agent filter by recency. Implement a refresh/re-ingest pipeline for important sources.
5. Embedding Model Mismatch
You embedded documents with model A and query with model B. Vectors live in incompatible spaces. Fix: Never mix embedding models. If you switch models, re-embed everything.
What the Community Says
Hacker News
The HN community has been actively discussing these technologies:
- "Forget Vector Databases: RAG with Just SQL and LLM" โ RisingWave's argument that you can do RAG with SQL streaming, no vector DB needed. Interesting but niche.[9]
- "I accidentally built SQLite for AI memory (Memvid)" โ A project storing memories as video-encoded vectors. Creative approach to the persistence problem.[10]
- "Clamp โ Git-like version control for RAG vector databases" โ Versioning for your knowledge base, solving the "oops I corrupted my embeddings" problem.
- MCP discussions are exploding โ MindsDB, PostgreSQL, and dozens of other tools adding MCP support as it becomes the de facto standard.
Reddit (r/MachineLearning)
The consensus on vector database choice from the ML community:[8]
- "Qdrant replaced Milvus and I've had no issues" โ common sentiment
- "FAISS is NOT a vector database. It's a lower-level index" โ important distinction
- "Pinecone has vendor lock-in; I tend to avoid these" โ preference for open-source
- ChromaDB getting better but still seen as prototyping-only by many
GitHub Trending
The major RAG frameworks being used in production:
- LlamaIndex โ Purpose-built for RAG. Best for connecting LLMs to custom data. Excellent MCP integration.
- LangChain โ General-purpose LLM framework. Larger ecosystem but more complexity. Good for chains of operations.
- Haystack (deepset) โ Production-focused. Strong on evaluation and pipeline composition.
- RAGFlow โ Open-source RAG engine with built-in document processing and chunking.
chromadb + openai + mcp. The frameworks add abstraction layers you don't need for this use case. When your needs grow, LlamaIndex is the natural next step โ it was literally designed for "connect your LLM to your data."
References
- Introducing the Model Context Protocol โ Anthropic, November 2024
- Model Context Protocol โ Wikipedia โ Overview, history, and AAIF donation
- A Year of MCP: From Internal Experiment to Industry Standard โ Pento AI, 2025
- Vibe Coding RAG with MCP Server โ Qdrant Blog
- rag-mcp-server โ PyPI package for RAG via MCP
- postgres-mcp โ PostgreSQL MCP server, GitHub
- RAG Infrastructure: Building Production Systems โ Introl, December 2025
- What's the best Vector DB? โ r/MachineLearning, February 2025
- Forget Vector Databases: RAG with Just SQL and LLM โ RisingWave
- Memvid โ SQLite for AI Memory โ GitHub
- MCP Specification โ modelcontextprotocol.io
- Best Vector Databases in 2025: A Complete Comparison โ Firecrawl
- Integrating Agentic RAG with MCP Servers โ Omar Santos, Medium
- How to Build a Python MCP Server to Consult a Knowledge Base โ Auth0 Blog