🎧 Listen ~10 min
📺 Watch the video version: Vector Embeddings: Local vs Cloud

Introduction

If you're building anything with AI — RAG pipelines, semantic search, recommendation systems, document clustering — you need vector embeddings. And if you're like most developers, you're probably using OpenAI's embedding API to generate them.

That works. But it means every piece of text you embed gets sent to OpenAI's servers, you pay per token, you need an internet connection, and you're locked into their ecosystem. What if you could generate embeddings that are 90–95% as good — completely free, running on your own machine, with zero data leaving your network?

That's exactly what this guide covers. We'll compare every major cloud embedding API against the best open-source local models, show you how to run them, and give you a step-by-step migration path from OpenAI to local embeddings.

💡 The Bottom Line For most RAG and search use cases, local embedding models like nomic-embed-text or BGE-M3 deliver results that are practically indistinguishable from OpenAI — at zero cost and with complete privacy. The quality gap has narrowed dramatically since 2024.

What Are Vector Embeddings?

Vector embeddings are numerical representations of text that capture semantic meaning. Instead of treating words as arbitrary symbols, embedding models convert text into dense arrays of floating-point numbers (typically 384 to 4096 dimensions) where similar meanings are close together in the vector space.

Think of it like coordinates on a map. "King" and "Queen" would be close together. "King" and "Banana" would be far apart. But unlike a 2D map, these embeddings exist in hundreds or thousands of dimensions, capturing nuances like formality, sentiment, topic, and context simultaneously.

Why Embeddings Matter

"The cat sat on the mat" ↓ Embedding Model [0.023, -0.156, 0.891, 0.442, -0.033, ... ] ← 768 dimensions "A feline rested on the rug" ↓ Same Model [0.021, -0.149, 0.887, 0.445, -0.031, ... ] ← Very similar vector! Cosine Similarity: 0.97 → The model "understands" they mean the same thing

Cloud Embedding APIs

Cloud APIs are the easiest way to get started. Send text, get vectors back. No GPU needed, no model downloads, no dependency management. But you pay per token, your data leaves your machine, and you need an internet connection.

OpenAI

The most popular choice. OpenAI offers two current embedding models:[1]

OpenAI's embeddings support 8191 token context length and work well across languages. The API is dead simple, and they integrate with every vector database on the planet. The downside: every text you embed goes through their servers.

Cohere

embed-v4 is Cohere's latest, scoring 65.2 on MTEB — actually beating OpenAI's large model.[2] It outputs 1024-dimensional vectors and excels with noisy, real-world data (think typos, mixed formatting, messy web scrapes). Pricing is $0.10/1M tokens with a generous free tier of 100 API calls per minute for trial keys. Cohere is particularly strong for enterprise RAG deployments where data quality varies.

Voyage AI

Anthropic's recommended embedding provider. Voyage-3-large scores 66.8 on MTEB with 1536 dimensions at $0.12/1M tokens. They also offer domain-specific models: voyage-code-2 for code search and voyage-law-2 for legal documents. Free tier available with 200 RPM. If you're already in the Anthropic ecosystem (Claude, etc.), Voyage is the natural pairing.[3]

Google

Gemini-embedding-001 recently took the #1 spot on MTEB with a score of 68.3 and 3072 dimensions — significantly ahead of the competition. Available through Vertex AI with a free tier. At ~$0.004/1K tokens, it's also among the cheapest cloud options. Supports 100+ languages. The catch: you're in the Google Cloud ecosystem.[2]

Amazon Bedrock

Titan Text Embeddings V2 offers 1024 dimensions with 8192 token context. Pricing is ~$0.02/1M tokens. The main advantage is seamless integration with the AWS ecosystem — if you're already on AWS with S3, Lambda, and SageMaker, Titan embeddings slot right in. Quality is decent but below the leaders on MTEB benchmarks.

Cloud Pricing Comparison

Provider Model MTEB Score Dimensions Price / 1M Tokens Free Tier
Google gemini-embedding-001 68.3 3072 ~$0.004/1K ✅ Vertex AI
Voyage AI voyage-3-large 66.8 1536 $0.12 ✅ 200 RPM
Cohere embed-v4 65.2 1024 $0.10 ✅ 100 calls/min
OpenAI text-embedding-3-large 64.6 3072 $0.13
OpenAI text-embedding-3-small 62.3 1536 $0.02
Amazon Titan Text V2 ~60 1024 ~$0.02

Local Embedding Models

Here's where things get exciting. The open-source embedding ecosystem has exploded. Several local models now rival or exceed OpenAI's quality — and they're completely free.

The Top Local Models

Qwen3-Embedding (Best Overall Open-Source)

Alibaba's Qwen team released Qwen3-Embedding in 2025, and it immediately shot to the top of the MTEB leaderboard with a score of 70.58 (multilingual). Available in 0.6B, 4B, and 8B parameter variants. The 0.6B version runs on modest hardware while still delivering excellent quality. Apache 2.0 license, 4096 dimensions, outstanding multilingual support.[2]

nomic-embed-text

The community favorite for Ollama users. Nomic-embed-text v1.5 scores 59.4 on MTEB with 768 dimensions and 8192 token context. It outperforms OpenAI's text-embedding-ada-002 (the previous generation) and is competitive with text-embedding-3-small. The killer feature: it runs effortlessly via ollama pull nomic-embed-text — one command and you're generating embeddings locally. Apache 2.0 license. Supports Matryoshka dimensionality.[4]

BGE-M3 (BAAI General Embedding)

Beijing Academy of AI's BGE family is a powerhouse. BGE-M3 scores 63.0 on MTEB with 1024 dimensions, supports 100+ languages, and handles up to 8192 tokens. It supports three retrieval methods simultaneously: dense, sparse, and multi-vector. The smaller variants (bge-small-en-v1.5, bge-base-en-v1.5, bge-large-en-v1.5) are great for English-only use cases. MIT license.[5]

all-MiniLM-L6-v2

The classic choice from sentence-transformers. At only 22M parameters and 384 dimensions, it's blazingly fast — easily 10,000+ embeddings per second on a CPU. MTEB score of 56.3 makes it the weakest on this list, but for prototyping, small datasets, or when speed matters more than absolute quality, it's unbeatable. Apache 2.0 license.[6]

E5 (Microsoft)

Microsoft's E5 family (e5-small, e5-base, e5-large, e5-mistral-7b-instruct) spans from tiny to massive. The instruction-tuned e5-mistral-7b-instruct was one of the first LLM-based embedding models and still performs well. The smaller variants offer a good balance of quality and speed. MIT license for most variants.

GTE (Alibaba — General Text Embeddings)

Alibaba's GTE family includes gte-Qwen2-1.5B-instruct and the newer Qwen3-Embedding models. Excellent multilingual support across 100+ languages. The instruction-tuned variants accept custom prompts for different retrieval tasks. Apache 2.0 license.

mxbai-embed-large

From Mixedbread AI, this model delivers excellent quality in a relatively compact package. 1024 dimensions, strong performance on retrieval tasks. Available through Ollama as mxbai-embed-large. Apache 2.0 license. A great choice when you want quality that approaches the larger models without the memory footprint.

Instructor

Unique approach — you provide an instruction like "Represent the Science document for retrieval" with each embedding request, and the model tailors its embeddings to the task. This lets one model serve multiple embedding strategies (search queries vs documents, different domains). Slightly slower due to the instruction prefix, but more flexible.[7]

jina-embeddings-v2

Jina AI's v2 models support 8192 token context — matching cloud APIs. Available in small (33M params) and base (137M params) variants. Good for long documents where you don't want aggressive chunking. Apache 2.0 license. Jina also offers v3 with task-specific adapters.

Local Model Comparison

Model MTEB Score Dimensions Parameters Context Speed (CPU) RAM Needed
Qwen3-Embedding-0.6B ~66 4096 0.6B 32K Medium ~2 GB
BGE-M3 63.0 1024 568M 8192 Medium ~2 GB
nomic-embed-text v1.5 59.4 768 137M 8192 Fast ~600 MB
mxbai-embed-large ~61 1024 335M 512 Medium ~1.3 GB
bge-large-en-v1.5 ~60 1024 335M 512 Medium ~1.3 GB
e5-large-v2 ~59 1024 335M 512 Medium ~1.3 GB
all-MiniLM-L6-v2 56.3 384 22M 256 Very Fast ~90 MB
jina-embeddings-v2-base ~58 768 137M 8192 Fast ~600 MB

How to Run Embeddings Locally

Five battle-tested methods, from easiest to most flexible.

Method 1: Ollama (Easiest)

If you already have Ollama installed, this is a one-liner:[8]

# Pull the model (one time)
ollama pull nomic-embed-text

# Generate embeddings via API
curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "The cat sat on the mat"}'

# Response: {"embedding": [0.023, -0.156, 0.891, ...]}

Ollama also supports mxbai-embed-large, all-minilm, and snowflake-arctic-embed. Switch models by changing the model name — the API is identical. Python usage:

import requests

def get_embedding(text, model="nomic-embed-text"):
    response = requests.post("http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text})
    return response.json()["embedding"]

embedding = get_embedding("Hello world")
print(f"Dimensions: {len(embedding)}")  # 768

Method 2: sentence-transformers (Most Popular Python Library)

pip install sentence-transformers
from sentence_transformers import SentenceTransformer

# Load model (downloads on first use, cached after)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Single text
embedding = model.encode("The cat sat on the mat")
print(embedding.shape)  # (1024,)

# Batch processing (much faster)
texts = ["First document", "Second document", "Third document"]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
print(embeddings.shape)  # (3, 1024)

sentence-transformers supports virtually every model on HuggingFace. It handles GPU acceleration automatically when CUDA is available.[6]

Method 3: HuggingFace Transformers (Full Control)

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-en-v1.5")
model = AutoModel.from_pretrained("BAAI/bge-large-en-v1.5")

def get_embeddings(texts):
    encoded = tokenizer(texts, padding=True, truncation=True,
                        max_length=512, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**encoded)
    # Use CLS token embedding, then normalize
    embeddings = outputs.last_hidden_state[:, 0]
    return F.normalize(embeddings, p=2, dim=1)

embeddings = get_embeddings(["Hello world", "Hi there"])
similarity = torch.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}")  # ~0.85

Method 4: llama.cpp (Quantized, Low Memory)

For running on machines with limited RAM. GGUF quantized models can cut memory usage by 50–75%:

# Build llama.cpp with embedding support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a GGUF embedding model
# (check HuggingFace for GGUF versions of your preferred model)

# Generate embeddings
./embedding -m nomic-embed-text-v1.5.Q4_K_M.gguf \
  -p "The cat sat on the mat" --embd-normalize 2

Method 5: FastEmbed (Optimized for Speed)

Qdrant's FastEmbed library uses ONNX Runtime for optimized inference — often 2–3x faster than sentence-transformers on CPU:[9]

pip install fastembed
from fastembed import TextEmbedding

# Default: BAAI/bge-small-en-v1.5
model = TextEmbedding()

# Or specify a model
model = TextEmbedding("BAAI/bge-large-en-v1.5")

embeddings = list(model.embed(["Hello world", "Hi there"]))
print(len(embeddings[0]))  # 1024

The Complete Comparison Matrix

Feature OpenAI Cohere Local (Ollama) Local (sentence-transformers)
Cost $0.02–$0.13/1M tokens $0.10/1M (free tier) Free forever Free forever
Speed Fast (network-bound) Fast (network-bound) Fast (GPU) / Medium (CPU) Fast (GPU) / Medium (CPU)
Privacy ⚠️ Data leaves machine ⚠️ Data leaves machine ✅ 100% local ✅ 100% local
Quality (MTEB) 62–65 65 56–63 (model-dependent) 56–70 (model-dependent)
Setup API key API key ollama pull pip install
Offline ❌ No ❌ No ✅ Yes ✅ Yes
GPU Required No (cloud) No (cloud) Optional Optional
Batch Processing Up to 2048 inputs Up to 96 inputs Sequential Configurable batches
Model Choice 2 models 1 model ~5 models 100s of models

Quality vs Cost Analysis

Let's address the elephant in the room: are local models actually good enough?

The MTEB Numbers

The MTEB (Massive Text Embedding Benchmark) evaluates models across 8 categories: classification, clustering, pair classification, reranking, retrieval, semantic similarity, summarization, and bitext mining.[10]

Here's what the data shows:

🎯 Key Insight The gap between local and cloud embeddings has almost disappeared. In a real-world RAG evaluation by Tiger Data, OpenAI's large model achieved 80.5% accuracy while nomic-embed-text hit 71% and BGE-large reached 71.5%. That's a meaningful but narrowing gap — and for many applications, the difference is imperceptible to end users.[4]

When Cloud Still Wins

When Local Wins

Migration Guide: OpenAI → Local

Ready to make the switch? Here's the step-by-step process.

Step 1: Choose Your Local Model

🟢 Quick & Easy: nomic-embed-text via Ollama

Best for: Getting started fast, Ollama users, moderate quality needs

ollama pull nomic-embed-text — done in 60 seconds

🔵 Best Balance: BGE-M3 via sentence-transformers

Best for: Production use, multilingual, high quality

pip install sentence-transformers + model download (~2 GB)

🟣 Maximum Quality: Qwen3-Embedding

Best for: When you need to beat cloud APIs, have GPU available

0.6B version runs on CPU; 4B and 8B need GPU

Step 2: Re-embed Your Data

This is the critical step. You must re-embed all existing data with your new model. You cannot mix embeddings from different models — they exist in different vector spaces.

# Migration script example
from sentence_transformers import SentenceTransformer
import chromadb

# Load new model
model = SentenceTransformer("BAAI/bge-m3")

# Connect to your existing ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
old_collection = client.get_collection("my_documents")

# Create new collection with new embedding function
new_collection = client.create_collection(
    name="my_documents_v2",
    metadata={"embedding_model": "bge-m3"}
)

# Get all documents
results = old_collection.get(include=["documents", "metadatas"])

# Re-embed in batches
batch_size = 100
for i in range(0, len(results["documents"]), batch_size):
    batch_docs = results["documents"][i:i+batch_size]
    batch_meta = results["metadatas"][i:i+batch_size]
    batch_ids = results["ids"][i:i+batch_size]

    embeddings = model.encode(batch_docs).tolist()

    new_collection.add(
        documents=batch_docs,
        embeddings=embeddings,
        metadatas=batch_meta,
        ids=batch_ids
    )
    print(f"Migrated {i+batch_size}/{len(results['documents'])}")

# Swap: delete old, rename new
client.delete_collection("my_documents")
# Note: ChromaDB doesn't support rename — create with original name instead

Step 3: Watch Out For

Our Setup

At ThinkSmart.Life, we currently use OpenAI text-embedding-3-small for our vector database powering the AI agent's knowledge base (built with ChromaDB — see our MCP + RAG + Vector DB guide).

Here's our cost reality:

Our migration plan: once we have the GPU rig set up, we're switching to BGE-M3 or Qwen3-Embedding-0.6B running locally via sentence-transformers. ChromaDB makes this straightforward — you can swap the embedding function with a few lines of code:

import chromadb
from chromadb.utils import embedding_functions

# Before: OpenAI
# ef = embedding_functions.OpenAIEmbeddingFunction(
#     api_key="sk-...", model_name="text-embedding-3-small")

# After: Local via sentence-transformers
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-m3"
)

collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=ef
)
💡 The Real Win Switching to local embeddings doesn't just save money — it eliminates a dependency. No API keys to manage, no rate limits to worry about, no service outages to plan around. Your embedding pipeline becomes a self-contained, portable piece of infrastructure that runs anywhere.

What the Community Says

The developer community has been rapidly adopting local embeddings. Here's what we found:

⚠️ One Caveat Don't chase the MTEB leaderboard blindly. The overall score blends 8 different task categories. A model that tops the leaderboard might underperform on retrieval (the task that matters most for RAG) while excelling at classification or clustering. Always check the specific benchmark that matches your use case.[11]

References

  1. OpenAI Embeddings Documentation — platform.openai.com/docs/guides/embeddings
  2. MTEB Leaderboard & Model Scores — Ailog RAG: Best Embedding Models 2025
  3. Voyage AI Embedding Models — voyageai.com
  4. Nomic Embed: A Truly Open Embedding Model — nomic.ai/blog; Tiger Data RAG Evaluation — tigerdata.com
  5. BGE Models — huggingface.co/BAAI/bge-m3
  6. sentence-transformers Documentation — sbert.net
  7. Instructor Embedding — instructor-embedding.github.io
  8. Ollama Embedding Models — ollama.com/blog/embedding-models
  9. FastEmbed by Qdrant — github.com/qdrant/fastembed
  10. MTEB: Massive Text Embedding Benchmark — huggingface.co/spaces/mteb/leaderboard
  11. Modal: Top Embedding Models on the MTEB Leaderboard — modal.com/blog