๐ŸŽง Listen ~10 min
๐Ÿ“บ Watch the video version: MCP + RAG + Vector DB Guide

Introduction

Every AI agent has the same fundamental limitation: it forgets. Each conversation starts from zero. The research it did yesterday, the preferences it learned, the documents it analyzed โ€” all gone when the session ends. Context windows are getting larger (Gemini's 2M tokens, Claude's 200K), but they're still finite, expensive to fill, and ephemeral.

What if your AI agent could build a permanent, searchable knowledge base from everything it encounters? Every web page it researches, every PDF it reads, every conversation insight โ€” indexed, embedded, and retrievable in milliseconds. That's what we're building in this guide.

The stack is three technologies working together:

By the end, you'll have a working system where your AI agent can store anything it learns and recall it when needed โ€” across sessions, across topics, growing smarter over time.

The Problem: AI Amnesia

Let's be concrete about what we're solving. Consider an AI agent like OpenClaw running Claude:

The agent has no persistent memory. It's like having an incredibly smart colleague who gets amnesia every time they leave the room. File-based memory (like markdown notes) helps, but it doesn't scale โ€” you can't semantic-search through 10,000 markdown files efficiently.

๐Ÿ’ก What We Want An agent that can say: "Based on the 47 articles I've read about vector databases over the past 3 months, here's what I recommend for your use caseโ€ฆ" โ€” and actually have those 47 articles in its retrievable memory.

MCP: The Universal Plug for AI Agents

What Is MCP?

The Model Context Protocol (MCP) is an open standard released by Anthropic in November 2024. Think of it as USB-C for AI โ€” a single, universal interface that connects any AI model to any external tool or data source.[1]

Before MCP, connecting an AI agent to, say, a database required building a custom integration for each combination of AI model and data source. If you had 5 AI models and 10 data sources, that's 50 custom integrations. MCP collapses this to 5 + 10 = 15: each model implements MCP once, each data source implements MCP once, and they all work together.

How MCP Works

MCP uses a client-server architecture built on JSON-RPC 2.0, inspired by the Language Server Protocol (LSP) that powers IDE features like autocomplete:[2]

When Claude Desktop connects to an MCP server, it discovers what tools are available and can call them as needed during conversation. The AI decides when to use a tool based on the user's question โ€” it's not hardcoded.

Why Anthropic Created MCP

The origin story is pragmatic. Developer David Soria Parra was frustrated with constantly copying context between tools and AI assistants.[3] Every integration was a one-off. MCP emerged as the universal solution โ€” build the connector once, use it everywhere. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF) under the Linux Foundation, co-founded with Block and OpenAI, signaling it's becoming an industry standard, not just an Anthropic project.[2]

MCP vs Traditional API Integrations

Aspect Traditional APIs MCP
Discovery Manual โ€” read docs, write code Automatic โ€” AI discovers available tools
Integration effort Custom per API ร— per AI model Implement once per side
Security Varies wildly Standardized capability negotiation
Composability Each integration is isolated AI can combine tools from multiple servers
Ecosystem Fragmented Growing โ€” 10,000+ community servers

Existing MCP Servers for Knowledge & Memory

The MCP ecosystem already has servers relevant to building a knowledge base:

๐ŸŽฏ The Fastest Path If you just want a working knowledge base today: install mcp-server-qdrant, run Qdrant locally via Docker, and configure it in Claude Desktop or OpenClaw. Your agent can immediately start storing and retrieving information semantically. We'll build a custom one later for more control.

RAG: How Retrieval-Augmented Generation Works

RAG is the technique that makes this all work. Instead of hoping the AI "knows" something from training, you find relevant information and inject it into the prompt before the AI generates a response. It's like giving the AI an open-book exam instead of a closed-book one.

The RAG pipeline has two phases:

Ingestion Pipeline (Offline)

  1. Ingest โ€” Load documents (PDFs, web pages, markdown, tweets, code files)
  2. Chunk โ€” Split documents into smaller pieces (typically 200โ€“1000 tokens each)
  3. Embed โ€” Convert each chunk into a high-dimensional vector using an embedding model
  4. Store โ€” Save vectors + original text + metadata in a vector database

Query Pipeline (Online)

  1. Embed query โ€” Convert the user's question into a vector using the same embedding model
  2. Search โ€” Find the most similar vectors in the database (cosine similarity / dot product)
  3. Rerank (optional) โ€” Use a cross-encoder to re-score results for better precision
  4. Augment โ€” Inject the top-k retrieved chunks into the LLM prompt as context
  5. Generate โ€” The LLM answers using both its training knowledge and the retrieved context

Chunking Strategies

How you split documents determines retrieval quality more than almost any other decision. Get this wrong and your agent retrieves garbage; get it right and it's magic.

Strategy How It Works Best For Pitfall
Fixed-size Split every N tokens/characters with overlap Simple, predictable. Good starting point Cuts mid-sentence, breaks context
Recursive Split by paragraphs โ†’ sentences โ†’ words, recursively Most content types. LangChain default Still structural, not semantic
Semantic Embed sentences, split where similarity drops Long documents with topic shifts Slower, requires embedding calls during ingestion
Heading-aware Split on markdown/HTML headers Structured docs, READMEs, wikis Sections may be too large or too small
Agentic / Proposition LLM extracts atomic facts from text Highest quality, research applications Expensive โ€” requires LLM call per chunk
๐Ÿ’ก Practical Advice Start with recursive chunking at 512 tokens with 50-token overlap. This works for 80% of use cases. Semantic chunking improves recall by up to 9% but adds complexity.[7] Only use proposition chunking for high-value, small corpora where quality matters more than cost.

Embedding Models

The embedding model converts text into vectors. The quality of these vectors directly determines retrieval accuracy. Here's the current landscape:

Model Dimensions Context Cost Notes
OpenAI text-embedding-3-small 1536 8191 tokens $0.02 / 1M tokens Best value. Great for most use cases
OpenAI text-embedding-3-large 3072 8191 tokens $0.13 / 1M tokens Higher quality, 6.5ร— more expensive
Voyage-3-large 1024 32K tokens $0.18 / 1M tokens Top MTEB scores, outperforms OpenAI by 9-20%[7]
Cohere embed-v3 1024 512 tokens $0.10 / 1M tokens Built-in search/classification modes
nomic-embed-text (local) 768 8192 tokens Free (local) Best open-source option. Runs on CPU
BGE-large-en-v1.5 (local) 1024 512 tokens Free (local) Strong MTEB scores, BAAI model
Granite-embedding-278m 768 512 tokens Free (local) IBM model, multilingual support
๐Ÿ’ก Our Pick: OpenAI text-embedding-3-small For a personal knowledge base, cost is the main concern. At $0.02 per million tokens, you can embed 50,000 document chunks for about $0.50. The quality is excellent for general-purpose retrieval. If you want zero cost and full privacy, use nomic-embed-text via Ollama locally.

Retrieval Strategies & Reranking

Similarity Search

The simplest approach: embed the query, find the K nearest vectors by cosine similarity. Fast, works well for straightforward questions. This is what 90% of RAG implementations use.

Hybrid Search (BM25 + Vector)

Combines traditional keyword matching (BM25/TF-IDF) with semantic vector search. This catches cases where the user asks for a specific term that semantic search might miss. For example, searching for "pgvector" โ€” keyword search finds exact matches, while vector search finds conceptually similar content about "PostgreSQL vector extension."

Weaviate and Qdrant have hybrid search built-in. For others, you implement it by running both searches and merging results with reciprocal rank fusion (RRF).

Reranking

After initial retrieval (say, top 20 results), a cross-encoder reranker scores each result against the original query more carefully. This is slower but dramatically improves precision. Popular rerankers:

โš ๏ธ Don't Skip Reranking for Production Initial vector search has ~70-80% precision. Adding a reranker bumps this to 90%+. The difference between "pretty good" and "actually useful" for your agent.

Vector Database Comparison

This is the most-asked question in the RAG space. Here's an honest comparison based on real-world usage, community sentiment, and benchmarks:

Database Type Language Hybrid Search Best For Free Tier
ChromaDB Embedded Python Basic Prototyping, small projects Fully open-source
Qdrant Client-server Rust Yes (sparse + dense) Production local/cloud, fast filtering Open-source + 1GB cloud free
Weaviate Client-server Go Built-in (best) Knowledge graphs, hybrid search Open-source + sandbox
Pinecone Managed cloud โ€” Yes Zero-ops, scaling without thinking Serverless free tier
pgvector PG extension C With tsvector Already using PostgreSQL Free (PG extension)
Milvus Distributed Go/C++ Yes Enterprise, billions of vectors Open-source (Zilliz cloud)
FAISS Library C++ (Python) No Research, in-memory speed Free (Meta library)
LanceDB Embedded Rust Yes Serverless, multimodal Fully open-source

Which Should You Pick?

๐ŸŸข Just Getting Started โ†’ ChromaDB

pip install chromadb and you're running. No Docker, no config. Perfect for prototyping your RAG pipeline. Limitation: not designed for production scale or concurrent access.

๐Ÿ”ต Local Production โ†’ Qdrant

docker run -p 6333:6333 qdrant/qdrant and you have a production-grade vector DB running locally. Written in Rust, so it's fast. Has an official MCP server. Best balance of ease-of-use, performance, and features. The Reddit consensus backs this.[8]

๐ŸŸฃ Already Using Postgres โ†’ pgvector

Don't add another database. Install the extension: CREATE EXTENSION vector;. Your vectors live alongside your relational data. Supports HNSW and IVFFlat indexes. Performance is good for up to ~50-100M vectors.

The Complete Architecture

Here's what we're building โ€” an AI agent (OpenClaw/Claude) with persistent, searchable memory that grows over time:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ AI AGENT (Claude / OpenClaw) โ”‚ โ”‚ โ”‚ โ”‚ User asks: "What did we learn about Qdrant pricing?" โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ MCP Client โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ JSON-RPC (stdio) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ MCP Knowledge Server โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Ingest โ”‚ โ”‚ Search โ”‚ โ”‚ Manage โ”‚ โ”‚ โ”‚ โ”‚ Tool โ”‚ โ”‚ Tool โ”‚ โ”‚ Tool โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ RAG Pipeline โ”‚ โ”‚ โ”‚ โ”‚ chunk โ†’ embed โ†’ store query โ†’ retrieve โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Vector Database (Qdrant / ChromaDB) โ”‚ โ”‚ โ”‚ โ”‚ Collection: "knowledge_base" โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Vector โ”‚ Payload โ”‚ Metadata โ”‚ โ”‚ โ”‚ โ”‚ [0.1,.] โ”‚ "Qdrant's โ”‚ {source: "article", โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ free tier โ”‚ url: "...", โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ includes" โ”‚ date: "2026-02-20"} โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Pipeline: Ingest โ†’ Chunk โ†’ Embed โ†’ Store

When the agent encounters useful information (articles, docs, conversations), it calls the MCP server's ingest tool:

  1. Ingest: Accept raw text, URL, file path, or HTML
  2. Parse: Extract clean text (strip HTML, extract PDF text)
  3. Chunk: Split into ~512-token pieces with overlap
  4. Embed: Send chunks to embedding API, get vectors back
  5. Store: Upsert vectors + text + metadata into vector DB

Query Pipeline: Question โ†’ Retrieve โ†’ Augment

When the agent needs to answer a question, it calls the search tool:

  1. Embed query: Convert the question to a vector
  2. Search: Find top-K similar chunks (K=5-10)
  3. Filter: Apply metadata filters (date range, source type)
  4. Return: Send relevant chunks back to the agent as context

Getting Started: The Simplest Setup

Let's build the simplest possible knowledge base: ChromaDB + OpenAI embeddings + a custom MCP server. Total setup time: ~15 minutes.

Prerequisites

Step 1: Install Dependencies

pip install chromadb openai mcp[cli] beautifulsoup4 pypdf2

Step 2: Create the MCP Server

This server exposes three tools: ingest_text, ingest_url, and search.

Complete Code: Knowledge Base MCP Server

# knowledge_mcp_server.py
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import chromadb
from openai import OpenAI
import json, hashlib, re
from bs4 import BeautifulSoup
import urllib.request

# --- Configuration ---
EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "knowledge_base"
CHUNK_SIZE = 512       # tokens (roughly 4 chars per token)
CHUNK_OVERLAP = 50     # token overlap between chunks
TOP_K = 8              # number of results to return

# --- Init ---
openai_client = OpenAI()  # uses OPENAI_API_KEY env var
chroma_client = chromadb.PersistentClient(path="./knowledge_db")
collection = chroma_client.get_or_create_collection(
    name=COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"}
)

server = Server("knowledge-base")

def chunk_text(text: str, chunk_size: int = CHUNK_SIZE,
               overlap: int = CHUNK_OVERLAP) -> list[str]:
    """Split text into overlapping chunks by approximate token count."""
    words = text.split()
    tokens_per_chunk = chunk_size * 4  # rough char estimate
    chars_overlap = overlap * 4
    chunks = []
    start = 0
    text_clean = " ".join(words)
    while start < len(text_clean):
        end = start + tokens_per_chunk
        chunk = text_clean[start:end]
        if chunk.strip():
            chunks.append(chunk.strip())
        start = end - chars_overlap
    return chunks if chunks else [text_clean[:tokens_per_chunk]]

def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Get embeddings from OpenAI."""
    response = openai_client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=texts
    )
    return [e.embedding for e in response.data]

def ingest(text: str, source: str = "manual",
           url: str = "", metadata: dict = None):
    """Chunk, embed, and store text."""
    chunks = chunk_text(text)
    embeddings = get_embeddings(chunks)
    ids = [hashlib.md5(
        (c + source).encode()).hexdigest() for c in chunks]
    metadatas = [
        {"source": source, "url": url,
         "chunk_index": i, "total_chunks": len(chunks),
         **(metadata or {})}
        for i in range(len(chunks))
    ]
    collection.upsert(
        ids=ids, embeddings=embeddings,
        documents=chunks, metadatas=metadatas
    )
    return len(chunks)

@server.list_tools()
async def list_tools():
    return [
        Tool(name="ingest_text",
             description="Store text in the knowledge base "
                         "for future retrieval.",
             inputSchema={
                 "type": "object",
                 "properties": {
                     "text": {"type": "string",
                              "description": "Text to store"},
                     "source": {"type": "string",
                                "description": "Source label"},
                     "url": {"type": "string",
                             "description": "Source URL"}
                 },
                 "required": ["text"]
             }),
        Tool(name="ingest_url",
             description="Fetch a URL, extract text, and store "
                         "in the knowledge base.",
             inputSchema={
                 "type": "object",
                 "properties": {
                     "url": {"type": "string",
                             "description": "URL to fetch"}
                 },
                 "required": ["url"]
             }),
        Tool(name="search",
             description="Search the knowledge base for "
                         "information relevant to a query.",
             inputSchema={
                 "type": "object",
                 "properties": {
                     "query": {"type": "string",
                               "description": "Search query"},
                     "n_results": {"type": "integer",
                                   "description": "Max results",
                                   "default": TOP_K}
                 },
                 "required": ["query"]
             }),
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "ingest_text":
        n = ingest(arguments["text"],
                   arguments.get("source", "manual"),
                   arguments.get("url", ""))
        return [TextContent(type="text",
                text=f"Stored {n} chunks in knowledge base.")]

    elif name == "ingest_url":
        url = arguments["url"]
        req = urllib.request.Request(url, headers={
            "User-Agent": "KnowledgeBot/1.0"})
        html = urllib.request.urlopen(req).read().decode()
        soup = BeautifulSoup(html, "html.parser")
        for tag in soup(["script", "style", "nav", "footer"]):
            tag.decompose()
        text = soup.get_text(separator="\n", strip=True)
        n = ingest(text, source="web", url=url)
        return [TextContent(type="text",
                text=f"Fetched and stored {n} chunks from {url}")]

    elif name == "search":
        query = arguments["query"]
        n_results = arguments.get("n_results", TOP_K)
        q_embedding = get_embeddings([query])[0]
        results = collection.query(
            query_embeddings=[q_embedding],
            n_results=n_results,
            include=["documents", "metadatas", "distances"]
        )
        output = []
        for i, doc in enumerate(results["documents"][0]):
            meta = results["metadatas"][0][i]
            dist = results["distances"][0][i]
            relevance = round(1 - dist, 3)
            output.append(
                f"[{relevance}] {doc[:500]}\n"
                f"  Source: {meta.get('source', '?')} "
                f"| URL: {meta.get('url', 'N/A')}")
        return [TextContent(type="text",
                text="\n\n".join(output) if output
                     else "No relevant results found.")]

async def main():
    async with stdio_server() as (read, write):
        await server.run(read, write,
                         server.create_initialization_options())

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Step 3: Configure Your AI Client

Add to Claude Desktop's claude_desktop_config.json:

{
  "mcpServers": {
    "knowledge-base": {
      "command": "python",
      "args": ["knowledge_mcp_server.py"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

For OpenClaw, add to your MCP configuration:

# In your OpenClaw config
mcp:
  servers:
    knowledge-base:
      command: python
      args: ["/path/to/knowledge_mcp_server.py"]
      env:
        OPENAI_API_KEY: "sk-..."

Step 4: Use It

Now your agent has three new capabilities:

Ingesting Different Data Types

# HTML/Web pages โ€” handled by ingest_url
# For PDFs:
from pypdf2 import PdfReader
reader = PdfReader("document.pdf")
text = "\n".join(page.extract_text() for page in reader.pages)
ingest(text, source="pdf", url="document.pdf")

# For Markdown files:
with open("notes.md") as f:
    ingest(f.read(), source="markdown", url="notes.md")

# For tweets/posts (pass as text):
ingest("Thread by @karpathy: The hottest new programming "
       "language is English...",
       source="twitter", url="https://x.com/karpathy/...")

RAG Pitfalls & How to Avoid Them

1. Chunk Size Mismatch

Too small (50 tokens): Chunks lack context. "It costs $20/month" โ€” what costs $20/month? Too large (2000 tokens): Chunks contain too many topics, reducing precision. Sweet spot: 256โ€“512 tokens for most content.

2. Irrelevant Retrieval

The top-K results look relevant by embedding similarity but don't actually answer the question. Fix: Add a relevance threshold (drop results below 0.7 cosine similarity). Add reranking.

3. Hallucination from Bad Context

The AI gets retrieved chunks that are tangentially related and confabulates an answer that sounds right but isn't grounded. Fix: Instruct the model to cite which retrieved chunk supports each claim. Use "answer only from the provided context" system prompts.

4. Stale Data

Your knowledge base has outdated information but no way to know. Fix: Store ingestion timestamps as metadata. Let the agent filter by recency. Implement a refresh/re-ingest pipeline for important sources.

5. Embedding Model Mismatch

You embedded documents with model A and query with model B. Vectors live in incompatible spaces. Fix: Never mix embedding models. If you switch models, re-embed everything.

What the Community Says

Hacker News

The HN community has been actively discussing these technologies:

Reddit (r/MachineLearning)

The consensus on vector database choice from the ML community:[8]

GitHub Trending

The major RAG frameworks being used in production:

๐ŸŽฏ Our Take For building an agent knowledge base specifically, skip the frameworks and build directly with chromadb + openai + mcp. The frameworks add abstraction layers you don't need for this use case. When your needs grow, LlamaIndex is the natural next step โ€” it was literally designed for "connect your LLM to your data."

References

  1. Introducing the Model Context Protocol โ€” Anthropic, November 2024
  2. Model Context Protocol โ€” Wikipedia โ€” Overview, history, and AAIF donation
  3. A Year of MCP: From Internal Experiment to Industry Standard โ€” Pento AI, 2025
  4. Vibe Coding RAG with MCP Server โ€” Qdrant Blog
  5. rag-mcp-server โ€” PyPI package for RAG via MCP
  6. postgres-mcp โ€” PostgreSQL MCP server, GitHub
  7. RAG Infrastructure: Building Production Systems โ€” Introl, December 2025
  8. What's the best Vector DB? โ€” r/MachineLearning, February 2025
  9. Forget Vector Databases: RAG with Just SQL and LLM โ€” RisingWave
  10. Memvid โ€” SQLite for AI Memory โ€” GitHub
  11. MCP Specification โ€” modelcontextprotocol.io
  12. Best Vector Databases in 2025: A Complete Comparison โ€” Firecrawl
  13. Integrating Agentic RAG with MCP Servers โ€” Omar Santos, Medium
  14. How to Build a Python MCP Server to Consult a Knowledge Base โ€” Auth0 Blog
๐Ÿ›ก๏ธ No Third-Party Tracking