Introduction
If you're building anything with AI — RAG pipelines, semantic search, recommendation systems, document clustering — you need vector embeddings. And if you're like most developers, you're probably using OpenAI's embedding API to generate them.
That works. But it means every piece of text you embed gets sent to OpenAI's servers, you pay per token, you need an internet connection, and you're locked into their ecosystem. What if you could generate embeddings that are 90–95% as good — completely free, running on your own machine, with zero data leaving your network?
That's exactly what this guide covers. We'll compare every major cloud embedding API against the best open-source local models, show you how to run them, and give you a step-by-step migration path from OpenAI to local embeddings.
What Are Vector Embeddings?
Vector embeddings are numerical representations of text that capture semantic meaning. Instead of treating words as arbitrary symbols, embedding models convert text into dense arrays of floating-point numbers (typically 384 to 4096 dimensions) where similar meanings are close together in the vector space.
Think of it like coordinates on a map. "King" and "Queen" would be close together. "King" and "Banana" would be far apart. But unlike a 2D map, these embeddings exist in hundreds or thousands of dimensions, capturing nuances like formality, sentiment, topic, and context simultaneously.
Why Embeddings Matter
- RAG (Retrieval-Augmented Generation) — Find the most relevant documents to inject into an LLM's context window. This is the #1 use case driving the embedding boom.
- Semantic Search — Search by meaning, not just keywords. "How to fix a slow computer" finds results about "performance optimization" even if those exact words don't appear.
- Recommendations — Find similar products, articles, or users by comparing their embedding vectors.
- Clustering & Classification — Group documents by topic, detect duplicates, categorize support tickets — all without labeled training data.
- Anomaly Detection — Identify outliers in a dataset by finding vectors far from any cluster.
Cloud Embedding APIs
Cloud APIs are the easiest way to get started. Send text, get vectors back. No GPU needed, no model downloads, no dependency management. But you pay per token, your data leaves your machine, and you need an internet connection.
OpenAI
The most popular choice. OpenAI offers two current embedding models:[1]
- text-embedding-3-small — 1536 dimensions, $0.02/1M tokens. Good for most use cases. MTEB score ~62.3.
- text-embedding-3-large — 3072 dimensions (configurable down to 256), $0.13/1M tokens. Higher quality. MTEB score ~64.6. Supports Matryoshka dimensionality reduction.
OpenAI's embeddings support 8191 token context length and work well across languages. The API is dead simple, and they integrate with every vector database on the planet. The downside: every text you embed goes through their servers.
Cohere
embed-v4 is Cohere's latest, scoring 65.2 on MTEB — actually beating OpenAI's large model.[2] It outputs 1024-dimensional vectors and excels with noisy, real-world data (think typos, mixed formatting, messy web scrapes). Pricing is $0.10/1M tokens with a generous free tier of 100 API calls per minute for trial keys. Cohere is particularly strong for enterprise RAG deployments where data quality varies.
Voyage AI
Anthropic's recommended embedding provider. Voyage-3-large scores 66.8 on MTEB with 1536 dimensions at $0.12/1M tokens. They also offer domain-specific models: voyage-code-2 for code search and voyage-law-2 for legal documents. Free tier available with 200 RPM. If you're already in the Anthropic ecosystem (Claude, etc.), Voyage is the natural pairing.[3]
Gemini-embedding-001 recently took the #1 spot on MTEB with a score of 68.3 and 3072 dimensions — significantly ahead of the competition. Available through Vertex AI with a free tier. At ~$0.004/1K tokens, it's also among the cheapest cloud options. Supports 100+ languages. The catch: you're in the Google Cloud ecosystem.[2]
Amazon Bedrock
Titan Text Embeddings V2 offers 1024 dimensions with 8192 token context. Pricing is ~$0.02/1M tokens. The main advantage is seamless integration with the AWS ecosystem — if you're already on AWS with S3, Lambda, and SageMaker, Titan embeddings slot right in. Quality is decent but below the leaders on MTEB benchmarks.
Cloud Pricing Comparison
| Provider | Model | MTEB Score | Dimensions | Price / 1M Tokens | Free Tier |
|---|---|---|---|---|---|
| gemini-embedding-001 | 68.3 | 3072 | ~$0.004/1K | ✅ Vertex AI | |
| Voyage AI | voyage-3-large | 66.8 | 1536 | $0.12 | ✅ 200 RPM |
| Cohere | embed-v4 | 65.2 | 1024 | $0.10 | ✅ 100 calls/min |
| OpenAI | text-embedding-3-large | 64.6 | 3072 | $0.13 | ❌ |
| OpenAI | text-embedding-3-small | 62.3 | 1536 | $0.02 | ❌ |
| Amazon | Titan Text V2 | ~60 | 1024 | ~$0.02 | ❌ |
Local Embedding Models
Here's where things get exciting. The open-source embedding ecosystem has exploded. Several local models now rival or exceed OpenAI's quality — and they're completely free.
The Top Local Models
Qwen3-Embedding (Best Overall Open-Source)
Alibaba's Qwen team released Qwen3-Embedding in 2025, and it immediately shot to the top of the MTEB leaderboard with a score of 70.58 (multilingual). Available in 0.6B, 4B, and 8B parameter variants. The 0.6B version runs on modest hardware while still delivering excellent quality. Apache 2.0 license, 4096 dimensions, outstanding multilingual support.[2]
nomic-embed-text
The community favorite for Ollama users. Nomic-embed-text v1.5 scores 59.4 on MTEB with 768 dimensions and 8192 token context. It outperforms OpenAI's text-embedding-ada-002 (the previous generation) and is competitive with text-embedding-3-small. The killer feature: it runs effortlessly via ollama pull nomic-embed-text — one command and you're generating embeddings locally. Apache 2.0 license. Supports Matryoshka dimensionality.[4]
BGE-M3 (BAAI General Embedding)
Beijing Academy of AI's BGE family is a powerhouse. BGE-M3 scores 63.0 on MTEB with 1024 dimensions, supports 100+ languages, and handles up to 8192 tokens. It supports three retrieval methods simultaneously: dense, sparse, and multi-vector. The smaller variants (bge-small-en-v1.5, bge-base-en-v1.5, bge-large-en-v1.5) are great for English-only use cases. MIT license.[5]
all-MiniLM-L6-v2
The classic choice from sentence-transformers. At only 22M parameters and 384 dimensions, it's blazingly fast — easily 10,000+ embeddings per second on a CPU. MTEB score of 56.3 makes it the weakest on this list, but for prototyping, small datasets, or when speed matters more than absolute quality, it's unbeatable. Apache 2.0 license.[6]
E5 (Microsoft)
Microsoft's E5 family (e5-small, e5-base, e5-large, e5-mistral-7b-instruct) spans from tiny to massive. The instruction-tuned e5-mistral-7b-instruct was one of the first LLM-based embedding models and still performs well. The smaller variants offer a good balance of quality and speed. MIT license for most variants.
GTE (Alibaba — General Text Embeddings)
Alibaba's GTE family includes gte-Qwen2-1.5B-instruct and the newer Qwen3-Embedding models. Excellent multilingual support across 100+ languages. The instruction-tuned variants accept custom prompts for different retrieval tasks. Apache 2.0 license.
mxbai-embed-large
From Mixedbread AI, this model delivers excellent quality in a relatively compact package. 1024 dimensions, strong performance on retrieval tasks. Available through Ollama as mxbai-embed-large. Apache 2.0 license. A great choice when you want quality that approaches the larger models without the memory footprint.
Instructor
Unique approach — you provide an instruction like "Represent the Science document for retrieval" with each embedding request, and the model tailors its embeddings to the task. This lets one model serve multiple embedding strategies (search queries vs documents, different domains). Slightly slower due to the instruction prefix, but more flexible.[7]
jina-embeddings-v2
Jina AI's v2 models support 8192 token context — matching cloud APIs. Available in small (33M params) and base (137M params) variants. Good for long documents where you don't want aggressive chunking. Apache 2.0 license. Jina also offers v3 with task-specific adapters.
Local Model Comparison
| Model | MTEB Score | Dimensions | Parameters | Context | Speed (CPU) | RAM Needed |
|---|---|---|---|---|---|---|
| Qwen3-Embedding-0.6B | ~66 | 4096 | 0.6B | 32K | Medium | ~2 GB |
| BGE-M3 | 63.0 | 1024 | 568M | 8192 | Medium | ~2 GB |
| nomic-embed-text v1.5 | 59.4 | 768 | 137M | 8192 | Fast | ~600 MB |
| mxbai-embed-large | ~61 | 1024 | 335M | 512 | Medium | ~1.3 GB |
| bge-large-en-v1.5 | ~60 | 1024 | 335M | 512 | Medium | ~1.3 GB |
| e5-large-v2 | ~59 | 1024 | 335M | 512 | Medium | ~1.3 GB |
| all-MiniLM-L6-v2 | 56.3 | 384 | 22M | 256 | Very Fast | ~90 MB |
| jina-embeddings-v2-base | ~58 | 768 | 137M | 8192 | Fast | ~600 MB |
How to Run Embeddings Locally
Five battle-tested methods, from easiest to most flexible.
Method 1: Ollama (Easiest)
If you already have Ollama installed, this is a one-liner:[8]
# Pull the model (one time)
ollama pull nomic-embed-text
# Generate embeddings via API
curl http://localhost:11434/api/embeddings \
-d '{"model": "nomic-embed-text", "prompt": "The cat sat on the mat"}'
# Response: {"embedding": [0.023, -0.156, 0.891, ...]}
Ollama also supports mxbai-embed-large, all-minilm, and snowflake-arctic-embed. Switch models by changing the model name — the API is identical. Python usage:
import requests
def get_embedding(text, model="nomic-embed-text"):
response = requests.post("http://localhost:11434/api/embeddings",
json={"model": model, "prompt": text})
return response.json()["embedding"]
embedding = get_embedding("Hello world")
print(f"Dimensions: {len(embedding)}") # 768
Method 2: sentence-transformers (Most Popular Python Library)
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
# Load model (downloads on first use, cached after)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Single text
embedding = model.encode("The cat sat on the mat")
print(embedding.shape) # (1024,)
# Batch processing (much faster)
texts = ["First document", "Second document", "Third document"]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
print(embeddings.shape) # (3, 1024)
sentence-transformers supports virtually every model on HuggingFace. It handles GPU acceleration automatically when CUDA is available.[6]
Method 3: HuggingFace Transformers (Full Control)
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-en-v1.5")
model = AutoModel.from_pretrained("BAAI/bge-large-en-v1.5")
def get_embeddings(texts):
encoded = tokenizer(texts, padding=True, truncation=True,
max_length=512, return_tensors="pt")
with torch.no_grad():
outputs = model(**encoded)
# Use CLS token embedding, then normalize
embeddings = outputs.last_hidden_state[:, 0]
return F.normalize(embeddings, p=2, dim=1)
embeddings = get_embeddings(["Hello world", "Hi there"])
similarity = torch.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}") # ~0.85
Method 4: llama.cpp (Quantized, Low Memory)
For running on machines with limited RAM. GGUF quantized models can cut memory usage by 50–75%:
# Build llama.cpp with embedding support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download a GGUF embedding model
# (check HuggingFace for GGUF versions of your preferred model)
# Generate embeddings
./embedding -m nomic-embed-text-v1.5.Q4_K_M.gguf \
-p "The cat sat on the mat" --embd-normalize 2
Method 5: FastEmbed (Optimized for Speed)
Qdrant's FastEmbed library uses ONNX Runtime for optimized inference — often 2–3x faster than sentence-transformers on CPU:[9]
pip install fastembed
from fastembed import TextEmbedding
# Default: BAAI/bge-small-en-v1.5
model = TextEmbedding()
# Or specify a model
model = TextEmbedding("BAAI/bge-large-en-v1.5")
embeddings = list(model.embed(["Hello world", "Hi there"]))
print(len(embeddings[0])) # 1024
The Complete Comparison Matrix
| Feature | OpenAI | Cohere | Local (Ollama) | Local (sentence-transformers) |
|---|---|---|---|---|
| Cost | $0.02–$0.13/1M tokens | $0.10/1M (free tier) | Free forever | Free forever |
| Speed | Fast (network-bound) | Fast (network-bound) | Fast (GPU) / Medium (CPU) | Fast (GPU) / Medium (CPU) |
| Privacy | ⚠️ Data leaves machine | ⚠️ Data leaves machine | ✅ 100% local | ✅ 100% local |
| Quality (MTEB) | 62–65 | 65 | 56–63 (model-dependent) | 56–70 (model-dependent) |
| Setup | API key | API key | ollama pull |
pip install |
| Offline | ❌ No | ❌ No | ✅ Yes | ✅ Yes |
| GPU Required | No (cloud) | No (cloud) | Optional | Optional |
| Batch Processing | Up to 2048 inputs | Up to 96 inputs | Sequential | Configurable batches |
| Model Choice | 2 models | 1 model | ~5 models | 100s of models |
Quality vs Cost Analysis
Let's address the elephant in the room: are local models actually good enough?
The MTEB Numbers
The MTEB (Massive Text Embedding Benchmark) evaluates models across 8 categories: classification, clustering, pair classification, reranking, retrieval, semantic similarity, summarization, and bitext mining.[10]
Here's what the data shows:
- Qwen3-Embedding-0.6B (free, local) scores ~66 — beating OpenAI text-embedding-3-large (64.6)
- BGE-M3 (free, local) scores 63.0 — nearly matching OpenAI's large model
- nomic-embed-text v1.5 (free, local) scores 59.4 — slightly below OpenAI's small model (62.3) but beating the previous ada-002
- The best local model (Qwen3-Embedding-8B at 70.58) actually exceeds every cloud API except Google's Gemini embedding
When Cloud Still Wins
- Zero setup — If you need embeddings in 5 minutes with no local infrastructure
- Massive scale — If you're embedding billions of documents and don't want to manage GPU clusters
- Cutting-edge quality — Google's Gemini embedding (68.3) currently has no equal in the local space
- Specialized domains — Voyage AI's code and legal models are hard to match locally
When Local Wins
- Privacy — Medical records, legal documents, proprietary code. Data never leaves your machine.
- Cost at scale — Embedding 10M documents costs $130+ with OpenAI. Local: $0.
- Offline — Air-gapped environments, unreliable internet, edge devices.
- Latency — No network round-trip. With a GPU, local embeddings can be faster than API calls.
- Control — Fine-tune on your domain, quantize for your hardware, customize everything.
Migration Guide: OpenAI → Local
Ready to make the switch? Here's the step-by-step process.
Step 1: Choose Your Local Model
🟢 Quick & Easy: nomic-embed-text via Ollama
Best for: Getting started fast, Ollama users, moderate quality needs
ollama pull nomic-embed-text — done in 60 seconds
🔵 Best Balance: BGE-M3 via sentence-transformers
Best for: Production use, multilingual, high quality
pip install sentence-transformers + model download (~2 GB)
🟣 Maximum Quality: Qwen3-Embedding
Best for: When you need to beat cloud APIs, have GPU available
0.6B version runs on CPU; 4B and 8B need GPU
Step 2: Re-embed Your Data
This is the critical step. You must re-embed all existing data with your new model. You cannot mix embeddings from different models — they exist in different vector spaces.
# Migration script example
from sentence_transformers import SentenceTransformer
import chromadb
# Load new model
model = SentenceTransformer("BAAI/bge-m3")
# Connect to your existing ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
old_collection = client.get_collection("my_documents")
# Create new collection with new embedding function
new_collection = client.create_collection(
name="my_documents_v2",
metadata={"embedding_model": "bge-m3"}
)
# Get all documents
results = old_collection.get(include=["documents", "metadatas"])
# Re-embed in batches
batch_size = 100
for i in range(0, len(results["documents"]), batch_size):
batch_docs = results["documents"][i:i+batch_size]
batch_meta = results["metadatas"][i:i+batch_size]
batch_ids = results["ids"][i:i+batch_size]
embeddings = model.encode(batch_docs).tolist()
new_collection.add(
documents=batch_docs,
embeddings=embeddings,
metadatas=batch_meta,
ids=batch_ids
)
print(f"Migrated {i+batch_size}/{len(results['documents'])}")
# Swap: delete old, rename new
client.delete_collection("my_documents")
# Note: ChromaDB doesn't support rename — create with original name instead
Step 3: Watch Out For
- Dimension differences — OpenAI text-embedding-3-small outputs 1536 dims; nomic-embed-text outputs 768. Your vector database index needs to match.
- Normalization — Some models output normalized vectors (unit length), others don't. BGE models expect you to normalize. Check your model's docs.
- Query prefixes — Some models (BGE, E5, Instructor) expect a prefix like "Represent this sentence:" for queries vs documents. Missing this drops quality significantly.
- Context length — OpenAI handles 8191 tokens. all-MiniLM only handles 256. Match your chunking strategy to your model.
Our Setup
At ThinkSmart.Life, we currently use OpenAI text-embedding-3-small for our vector database powering the AI agent's knowledge base (built with ChromaDB — see our MCP + RAG + Vector DB guide).
Here's our cost reality:
- We embed ~50,000 chunks across research articles, documentation, and web scrapes
- At $0.02/1M tokens, that costs us roughly $2–5/month in embedding costs
- Not bank-breaking, but it adds up — and every text we embed goes through OpenAI's servers
Our migration plan: once we have the GPU rig set up, we're switching to BGE-M3 or Qwen3-Embedding-0.6B running locally via sentence-transformers. ChromaDB makes this straightforward — you can swap the embedding function with a few lines of code:
import chromadb
from chromadb.utils import embedding_functions
# Before: OpenAI
# ef = embedding_functions.OpenAIEmbeddingFunction(
# api_key="sk-...", model_name="text-embedding-3-small")
# After: Local via sentence-transformers
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="BAAI/bge-m3"
)
collection = client.get_or_create_collection(
name="knowledge_base",
embedding_function=ef
)
What the Community Says
The developer community has been rapidly adopting local embeddings. Here's what we found:
- Hacker News discussions consistently show developers moving to local models for privacy and cost reasons. Projects like VectorDB-CLI demonstrate the growing ecosystem of tools built around local embedding models.
- The MTEB leaderboard on HuggingFace has become the de facto standard for comparing models, with new submissions constantly pushing the state of the art.[10]
- Ollama's embedding support has made local embeddings accessible to developers who previously found the Python ML ecosystem intimidating. "One command to pull, one API call to embed" is a compelling story.
- FastEmbed by Qdrant is gaining traction as the "production-ready" alternative to sentence-transformers, with ONNX optimization delivering 2–3x speed improvements on CPU.
- The Matryoshka approach (variable dimensionality from a single embedding) is being adopted by both cloud and local models, letting you trade off quality vs storage/speed without retraining.
References
- OpenAI Embeddings Documentation — platform.openai.com/docs/guides/embeddings
- MTEB Leaderboard & Model Scores — Ailog RAG: Best Embedding Models 2025
- Voyage AI Embedding Models — voyageai.com
- Nomic Embed: A Truly Open Embedding Model — nomic.ai/blog; Tiger Data RAG Evaluation — tigerdata.com
- BGE Models — huggingface.co/BAAI/bge-m3
- sentence-transformers Documentation — sbert.net
- Instructor Embedding — instructor-embedding.github.io
- Ollama Embedding Models — ollama.com/blog/embedding-models
- FastEmbed by Qdrant — github.com/qdrant/fastembed
- MTEB: Massive Text Embedding Benchmark — huggingface.co/spaces/mteb/leaderboard
- Modal: Top Embedding Models on the MTEB Leaderboard — modal.com/blog