Vector Embeddings Practical Guide

Vector Embeddings: Local vs Cloud — A Complete Guide to Generating Embeddings Without OpenAI

Stop paying for embeddings. Local models are 90–95% as good as cloud APIs — and they're completely free, private, and offline. Here's everything you need to make the switch.

Yaneth

22 min read · February 21, 2026

🎧 Listen

~10 min

📺 Watch the video version: Vector Embeddings: Local vs Cloud

Introduction

If you're building anything with AI — RAG pipelines, semantic search, recommendation systems, document clustering — you need vector embeddings. And if you're like most developers, you're probably using OpenAI's embedding API to generate them.

That works. But it means every piece of text you embed gets sent to OpenAI's servers, you pay per token, you need an internet connection, and you're locked into their ecosystem. What if you could generate embeddings that are 90–95% as good — completely free, running on your own machine, with zero data leaving your network?

That's exactly what this guide covers. We'll compare every major cloud embedding API against the best open-source local models, show you how to run them, and give you a step-by-step migration path from OpenAI to local embeddings.

💡 The Bottom Line For most RAG and search use cases, local embedding models like nomic-embed-text or BGE-M3 deliver results that are practically indistinguishable from OpenAI — at zero cost and with complete privacy. The quality gap has narrowed dramatically since 2024.

What Are Vector Embeddings?

Vector embeddings are numerical representations of text that capture semantic meaning. Instead of treating words as arbitrary symbols, embedding models convert text into dense arrays of floating-point numbers (typically 384 to 4096 dimensions) where similar meanings are close together in the vector space.

Think of it like coordinates on a map. "King" and "Queen" would be close together. "King" and "Banana" would be far apart. But unlike a 2D map, these embeddings exist in hundreds or thousands of dimensions, capturing nuances like formality, sentiment, topic, and context simultaneously.

Why Embeddings Matter

RAG (Retrieval-Augmented Generation) — Find the most relevant documents to inject into an LLM's context window. This is the #1 use case driving the embedding boom.
Semantic Search — Search by meaning, not just keywords. "How to fix a slow computer" finds results about "performance optimization" even if those exact words don't appear.
Recommendations — Find similar products, articles, or users by comparing their embedding vectors.
Clustering & Classification — Group documents by topic, detect duplicates, categorize support tickets — all without labeled training data.
Anomaly Detection — Identify outliers in a dataset by finding vectors far from any cluster.

"The cat sat on the mat" ↓ Embedding Model [0.023, -0.156, 0.891, 0.442, -0.033, ... ] ← 768 dimensions "A feline rested on the rug" ↓ Same Model [0.021, -0.149, 0.887, 0.445, -0.031, ... ] ← Very similar vector! Cosine Similarity: 0.97 → The model "understands" they mean the same thing

Cloud Embedding APIs

Cloud APIs are the easiest way to get started. Send text, get vectors back. No GPU needed, no model downloads, no dependency management. But you pay per token, your data leaves your machine, and you need an internet connection.

OpenAI

The most popular choice. OpenAI offers two current embedding models:^[1]

text-embedding-3-small — 1536 dimensions, $0.02/1M tokens. Good for most use cases. MTEB score ~62.3.
text-embedding-3-large — 3072 dimensions (configurable down to 256), $0.13/1M tokens. Higher quality. MTEB score ~64.6. Supports Matryoshka dimensionality reduction.

OpenAI's embeddings support 8191 token context length and work well across languages. The API is dead simple, and they integrate with every vector database on the planet. The downside: every text you embed goes through their servers.

Cohere

embed-v4 is Cohere's latest, scoring 65.2 on MTEB — actually beating OpenAI's large model.^[2] It outputs 1024-dimensional vectors and excels with noisy, real-world data (think typos, mixed formatting, messy web scrapes). Pricing is $0.10/1M tokens with a generous free tier of 100 API calls per minute for trial keys. Cohere is particularly strong for enterprise RAG deployments where data quality varies.

Voyage AI

Anthropic's recommended embedding provider. Voyage-3-large scores 66.8 on MTEB with 1536 dimensions at $0.12/1M tokens. They also offer domain-specific models: voyage-code-2 for code search and voyage-law-2 for legal documents. Free tier available with 200 RPM. If you're already in the Anthropic ecosystem (Claude, etc.), Voyage is the natural pairing.^[3]

Google

Gemini-embedding-001 recently took the #1 spot on MTEB with a score of 68.3 and 3072 dimensions — significantly ahead of the competition. Available through Vertex AI with a free tier. At ~$0.004/1K tokens, it's also among the cheapest cloud options. Supports 100+ languages. The catch: you're in the Google Cloud ecosystem.^[2]

Amazon Bedrock

Titan Text Embeddings V2 offers 1024 dimensions with 8192 token context. Pricing is ~$0.02/1M tokens. The main advantage is seamless integration with the AWS ecosystem — if you're already on AWS with S3, Lambda, and SageMaker, Titan embeddings slot right in. Quality is decent but below the leaders on MTEB benchmarks.

Cloud Pricing Comparison

Provider	Model	MTEB Score	Dimensions	Price / 1M Tokens	Free Tier
Google	gemini-embedding-001	68.3	3072	~$0.004/1K	✅ Vertex AI
Voyage AI	voyage-3-large	66.8	1536	$0.12	✅ 200 RPM
Cohere	embed-v4	65.2	1024	$0.10	✅ 100 calls/min
OpenAI	text-embedding-3-large	64.6	3072	$0.13	❌
OpenAI	text-embedding-3-small	62.3	1536	$0.02	❌
Amazon	Titan Text V2	~60	1024	~$0.02	❌

Local Embedding Models

Here's where things get exciting. The open-source embedding ecosystem has exploded. Several local models now rival or exceed OpenAI's quality — and they're completely free.

The Top Local Models

Qwen3-Embedding (Best Overall Open-Source)

Alibaba's Qwen team released Qwen3-Embedding in 2025, and it immediately shot to the top of the MTEB leaderboard with a score of 70.58 (multilingual). Available in 0.6B, 4B, and 8B parameter variants. The 0.6B version runs on modest hardware while still delivering excellent quality. Apache 2.0 license, 4096 dimensions, outstanding multilingual support.^[2]

nomic-embed-text

The community favorite for Ollama users. Nomic-embed-text v1.5 scores 59.4 on MTEB with 768 dimensions and 8192 token context. It outperforms OpenAI's text-embedding-ada-002 (the previous generation) and is competitive with text-embedding-3-small. The killer feature: it runs effortlessly via ollama pull nomic-embed-text — one command and you're generating embeddings locally. Apache 2.0 license. Supports Matryoshka dimensionality.^[4]

BGE-M3 (BAAI General Embedding)

Beijing Academy of AI's BGE family is a powerhouse. BGE-M3 scores 63.0 on MTEB with 1024 dimensions, supports 100+ languages, and handles up to 8192 tokens. It supports three retrieval methods simultaneously: dense, sparse, and multi-vector. The smaller variants (bge-small-en-v1.5, bge-base-en-v1.5, bge-large-en-v1.5) are great for English-only use cases. MIT license.^[5]

all-MiniLM-L6-v2

The classic choice from sentence-transformers. At only 22M parameters and 384 dimensions, it's blazingly fast — easily 10,000+ embeddings per second on a CPU. MTEB score of 56.3 makes it the weakest on this list, but for prototyping, small datasets, or when speed matters more than absolute quality, it's unbeatable. Apache 2.0 license.^[6]

E5 (Microsoft)

Microsoft's E5 family (e5-small, e5-base, e5-large, e5-mistral-7b-instruct) spans from tiny to massive. The instruction-tuned e5-mistral-7b-instruct was one of the first LLM-based embedding models and still performs well. The smaller variants offer a good balance of quality and speed. MIT license for most variants.

GTE (Alibaba — General Text Embeddings)

Alibaba's GTE family includes gte-Qwen2-1.5B-instruct and the newer Qwen3-Embedding models. Excellent multilingual support across 100+ languages. The instruction-tuned variants accept custom prompts for different retrieval tasks. Apache 2.0 license.

mxbai-embed-large

From Mixedbread AI, this model delivers excellent quality in a relatively compact package. 1024 dimensions, strong performance on retrieval tasks. Available through Ollama as mxbai-embed-large. Apache 2.0 license. A great choice when you want quality that approaches the larger models without the memory footprint.

Instructor

Unique approach — you provide an instruction like "Represent the Science document for retrieval" with each embedding request, and the model tailors its embeddings to the task. This lets one model serve multiple embedding strategies (search queries vs documents, different domains). Slightly slower due to the instruction prefix, but more flexible.^[7]

jina-embeddings-v2

Jina AI's v2 models support 8192 token context — matching cloud APIs. Available in small (33M params) and base (137M params) variants. Good for long documents where you don't want aggressive chunking. Apache 2.0 license. Jina also offers v3 with task-specific adapters.

Local Model Comparison

Model	MTEB Score	Dimensions	Parameters	Context	Speed (CPU)	RAM Needed
Qwen3-Embedding-0.6B	~66	4096	0.6B	32K	Medium	~2 GB
BGE-M3	63.0	1024	568M	8192	Medium	~2 GB
nomic-embed-text v1.5	59.4	768	137M	8192	Fast	~600 MB
mxbai-embed-large	~61	1024	335M	512	Medium	~1.3 GB
bge-large-en-v1.5	~60	1024	335M	512	Medium	~1.3 GB
e5-large-v2	~59	1024	335M	512	Medium	~1.3 GB
all-MiniLM-L6-v2	56.3	384	22M	256	Very Fast	~90 MB
jina-embeddings-v2-base	~58	768	137M	8192	Fast	~600 MB

How to Run Embeddings Locally

Five battle-tested methods, from easiest to most flexible.

Method 1: Ollama (Easiest)

If you already have Ollama installed, this is a one-liner:^[8]

# Pull the model (one time)
ollama pull nomic-embed-text

# Generate embeddings via API
curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "The cat sat on the mat"}'

# Response: {"embedding": [0.023, -0.156, 0.891, ...]}

Ollama also supports mxbai-embed-large, all-minilm, and snowflake-arctic-embed. Switch models by changing the model name — the API is identical. Python usage:

import requests

def get_embedding(text, model="nomic-embed-text"):
    response = requests.post("http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text})
    return response.json()["embedding"]

embedding = get_embedding("Hello world")
print(f"Dimensions: {len(embedding)}")  # 768

Method 2: sentence-transformers (Most Popular Python Library)

pip install sentence-transformers

from sentence_transformers import SentenceTransformer

# Load model (downloads on first use, cached after)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Single text
embedding = model.encode("The cat sat on the mat")
print(embedding.shape)  # (1024,)

# Batch processing (much faster)
texts = ["First document", "Second document", "Third document"]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
print(embeddings.shape)  # (3, 1024)

sentence-transformers supports virtually every model on HuggingFace. It handles GPU acceleration automatically when CUDA is available.^[6]

Method 3: HuggingFace Transformers (Full Control)

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-en-v1.5")
model = AutoModel.from_pretrained("BAAI/bge-large-en-v1.5")

def get_embeddings(texts):
    encoded = tokenizer(texts, padding=True, truncation=True,
                        max_length=512, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**encoded)
    # Use CLS token embedding, then normalize
    embeddings = outputs.last_hidden_state[:, 0]
    return F.normalize(embeddings, p=2, dim=1)

embeddings = get_embeddings(["Hello world", "Hi there"])
similarity = torch.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}")  # ~0.85

Method 4: llama.cpp (Quantized, Low Memory)

For running on machines with limited RAM. GGUF quantized models can cut memory usage by 50–75%:

# Build llama.cpp with embedding support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a GGUF embedding model
# (check HuggingFace for GGUF versions of your preferred model)

# Generate embeddings
./embedding -m nomic-embed-text-v1.5.Q4_K_M.gguf \
  -p "The cat sat on the mat" --embd-normalize 2

Method 5: FastEmbed (Optimized for Speed)

Qdrant's FastEmbed library uses ONNX Runtime for optimized inference — often 2–3x faster than sentence-transformers on CPU:^[9]

pip install fastembed

from fastembed import TextEmbedding

# Default: BAAI/bge-small-en-v1.5
model = TextEmbedding()

# Or specify a model
model = TextEmbedding("BAAI/bge-large-en-v1.5")

embeddings = list(model.embed(["Hello world", "Hi there"]))
print(len(embeddings[0]))  # 1024

The Complete Comparison Matrix

Feature	OpenAI	Cohere	Local (Ollama)	Local (sentence-transformers)
Cost	$0.02–$0.13/1M tokens	$0.10/1M (free tier)	Free forever	Free forever
Speed	Fast (network-bound)	Fast (network-bound)	Fast (GPU) / Medium (CPU)	Fast (GPU) / Medium (CPU)
Privacy	⚠️ Data leaves machine	⚠️ Data leaves machine	✅ 100% local	✅ 100% local
Quality (MTEB)	62–65	65	56–63 (model-dependent)	56–70 (model-dependent)
Setup	API key	API key	`ollama pull`	`pip install`
Offline	❌ No	❌ No	✅ Yes	✅ Yes
GPU Required	No (cloud)	No (cloud)	Optional	Optional
Batch Processing	Up to 2048 inputs	Up to 96 inputs	Sequential	Configurable batches
Model Choice	2 models	1 model	~5 models	100s of models

Quality vs Cost Analysis

Let's address the elephant in the room: are local models actually good enough?

The MTEB Numbers

The MTEB (Massive Text Embedding Benchmark) evaluates models across 8 categories: classification, clustering, pair classification, reranking, retrieval, semantic similarity, summarization, and bitext mining.^[10]

Here's what the data shows:

Qwen3-Embedding-0.6B (free, local) scores ~66 — beating OpenAI text-embedding-3-large (64.6)
BGE-M3 (free, local) scores 63.0 — nearly matching OpenAI's large model
nomic-embed-text v1.5 (free, local) scores 59.4 — slightly below OpenAI's small model (62.3) but beating the previous ada-002
The best local model (Qwen3-Embedding-8B at 70.58) actually exceeds every cloud API except Google's Gemini embedding

🎯 Key Insight The gap between local and cloud embeddings has almost disappeared. In a real-world RAG evaluation by Tiger Data, OpenAI's large model achieved 80.5% accuracy while nomic-embed-text hit 71% and BGE-large reached 71.5%. That's a meaningful but narrowing gap — and for many applications, the difference is imperceptible to end users.^[4]

When Cloud Still Wins

Zero setup — If you need embeddings in 5 minutes with no local infrastructure
Massive scale — If you're embedding billions of documents and don't want to manage GPU clusters
Cutting-edge quality — Google's Gemini embedding (68.3) currently has no equal in the local space
Specialized domains — Voyage AI's code and legal models are hard to match locally

When Local Wins

Privacy — Medical records, legal documents, proprietary code. Data never leaves your machine.
Cost at scale — Embedding 10M documents costs $130+ with OpenAI. Local: $0.
Offline — Air-gapped environments, unreliable internet, edge devices.
Latency — No network round-trip. With a GPU, local embeddings can be faster than API calls.
Control — Fine-tune on your domain, quantize for your hardware, customize everything.

Migration Guide: OpenAI → Local

Ready to make the switch? Here's the step-by-step process.

Step 1: Choose Your Local Model

🟢 Quick & Easy: nomic-embed-text via Ollama

Best for: Getting started fast, Ollama users, moderate quality needs

ollama pull nomic-embed-text — done in 60 seconds

🔵 Best Balance: BGE-M3 via sentence-transformers

Best for: Production use, multilingual, high quality

pip install sentence-transformers + model download (~2 GB)

🟣 Maximum Quality: Qwen3-Embedding

Best for: When you need to beat cloud APIs, have GPU available

0.6B version runs on CPU; 4B and 8B need GPU

Step 2: Re-embed Your Data

This is the critical step. You must re-embed all existing data with your new model. You cannot mix embeddings from different models — they exist in different vector spaces.

# Migration script example
from sentence_transformers import SentenceTransformer
import chromadb

# Load new model
model = SentenceTransformer("BAAI/bge-m3")

# Connect to your existing ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
old_collection = client.get_collection("my_documents")

# Create new collection with new embedding function
new_collection = client.create_collection(
    name="my_documents_v2",
    metadata={"embedding_model": "bge-m3"}
)

# Get all documents
results = old_collection.get(include=["documents", "metadatas"])

# Re-embed in batches
batch_size = 100
for i in range(0, len(results["documents"]), batch_size):
    batch_docs = results["documents"][i:i+batch_size]
    batch_meta = results["metadatas"][i:i+batch_size]
    batch_ids = results["ids"][i:i+batch_size]

    embeddings = model.encode(batch_docs).tolist()

    new_collection.add(
        documents=batch_docs,
        embeddings=embeddings,
        metadatas=batch_meta,
        ids=batch_ids
    )
    print(f"Migrated {i+batch_size}/{len(results['documents'])}")

# Swap: delete old, rename new
client.delete_collection("my_documents")
# Note: ChromaDB doesn't support rename — create with original name instead

Step 3: Watch Out For

Dimension differences — OpenAI text-embedding-3-small outputs 1536 dims; nomic-embed-text outputs 768. Your vector database index needs to match.
Normalization — Some models output normalized vectors (unit length), others don't. BGE models expect you to normalize. Check your model's docs.
Query prefixes — Some models (BGE, E5, Instructor) expect a prefix like "Represent this sentence:" for queries vs documents. Missing this drops quality significantly.
Context length — OpenAI handles 8191 tokens. all-MiniLM only handles 256. Match your chunking strategy to your model.

Our Setup

At ThinkSmart.Life, we currently use OpenAI text-embedding-3-small for our vector database powering the AI agent's knowledge base (built with ChromaDB — see our MCP + RAG + Vector DB guide).

Here's our cost reality:

We embed ~50,000 chunks across research articles, documentation, and web scrapes
At $0.02/1M tokens, that costs us roughly $2–5/month in embedding costs
Not bank-breaking, but it adds up — and every text we embed goes through OpenAI's servers

Our migration plan: once we have the GPU rig set up, we're switching to BGE-M3 or Qwen3-Embedding-0.6B running locally via sentence-transformers. ChromaDB makes this straightforward — you can swap the embedding function with a few lines of code:

import chromadb
from chromadb.utils import embedding_functions

# Before: OpenAI
# ef = embedding_functions.OpenAIEmbeddingFunction(
#     api_key="sk-...", model_name="text-embedding-3-small")

# After: Local via sentence-transformers
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-m3"
)

collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=ef
)

💡 The Real Win Switching to local embeddings doesn't just save money — it eliminates a dependency. No API keys to manage, no rate limits to worry about, no service outages to plan around. Your embedding pipeline becomes a self-contained, portable piece of infrastructure that runs anywhere.

What the Community Says

The developer community has been rapidly adopting local embeddings. Here's what we found:

Hacker News discussions consistently show developers moving to local models for privacy and cost reasons. Projects like VectorDB-CLI demonstrate the growing ecosystem of tools built around local embedding models.
The MTEB leaderboard on HuggingFace has become the de facto standard for comparing models, with new submissions constantly pushing the state of the art.^[10]
Ollama's embedding support has made local embeddings accessible to developers who previously found the Python ML ecosystem intimidating. "One command to pull, one API call to embed" is a compelling story.
FastEmbed by Qdrant is gaining traction as the "production-ready" alternative to sentence-transformers, with ONNX optimization delivering 2–3x speed improvements on CPU.
The Matryoshka approach (variable dimensionality from a single embedding) is being adopted by both cloud and local models, letting you trade off quality vs storage/speed without retraining.

⚠️ One Caveat Don't chase the MTEB leaderboard blindly. The overall score blends 8 different task categories. A model that tops the leaderboard might underperform on retrieval (the task that matters most for RAG) while excelling at classification or clustering. Always check the specific benchmark that matches your use case.^[11]

References

OpenAI Embeddings Documentation — platform.openai.com/docs/guides/embeddings
MTEB Leaderboard & Model Scores — Ailog RAG: Best Embedding Models 2025
Voyage AI Embedding Models — voyageai.com
Nomic Embed: A Truly Open Embedding Model — nomic.ai/blog; Tiger Data RAG Evaluation — tigerdata.com
BGE Models — huggingface.co/BAAI/bge-m3
sentence-transformers Documentation — sbert.net
Instructor Embedding — instructor-embedding.github.io
Ollama Embedding Models — ollama.com/blog/embedding-models
FastEmbed by Qdrant — github.com/qdrant/fastembed
MTEB: Massive Text Embedding Benchmark — huggingface.co/spaces/mteb/leaderboard
Modal: Top Embedding Models on the MTEB Leaderboard — modal.com/blog