1. Introduction
On March 10, 2023, a Bulgarian software engineer named Georgi Gerganov pushed a commit to GitHub that would reshape the AI landscape. His project, llama.cpp, implemented Meta's newly released LLaMA model in pure C/C++ — no Python, no PyTorch, no GPU required. Within days, people were running large language models on MacBooks, Raspberry Pis, and Android phones.
Nearly three years later, llama.cpp has over 85,000 GitHub stars, powers virtually every major local AI tool, and on February 20, 2026, the team behind it officially joined Hugging Face — the largest open-source AI platform in the world.
This guide covers everything: the origin story, how it works under the hood, the GGUF file format, the Hugging Face acquisition, every major tool built on top of llama.cpp, and every serious competitor trying to do the same thing differently.
2. The Origin Story
Georgi Gerganov had been building tensor computation libraries long before LLMs went mainstream. In late September 2022, he started work on GGML (Georgi Gerganov Machine Learning) — a lightweight C library for tensor algebra, inspired by Fabrice Bellard's work on LibNC. The design priorities were strict memory management and multi-threading — no Python runtime, no framework overhead.
His first big project with GGML was whisper.cpp — a C/C++ implementation of OpenAI's Whisper speech recognition model. It proved the concept: you could run serious neural networks on consumer CPUs with good performance by being smart about memory and computation.
Then, on February 24, 2023, Meta released the LLaMA (Large Language Model Meta AI) weights. Gerganov realized he could port the inference code to his C tensor library. In about a weekend of intense coding, llama.cpp was born — released on March 10, 2023.
The initial results were electric:
- M2 MacBook Pro: ~16 tokens/second with the 7B model
- Pixel 5 phone: ~1 token/second with the 7B model
- 4 GB RAM Raspberry Pi: ~0.1 tokens/second — slow but it worked
The project exploded. Within a month it had 19,000 GitHub stars. Justine Tunney (of Mozilla/Cosmopolitan fame) contributed major optimizations to memory mapping. The community contributed GPU backends, new quantization schemes, and support for dozens of model architectures beyond LLaMA.
Funding and ggml.ai
In June 2023, Gerganov founded ggml.ai to support full-time development. Nat Friedman (former GitHub CEO) and Daniel Gross provided pre-seed funding. The company hired full-time developers to maintain ggml and llama.cpp while keeping both projects fully open-source under the MIT license.
3. How It Works
The key insight behind llama.cpp is that LLM inference is memory-bandwidth bound, not compute-bound. During text generation (decoding), you're doing one matrix-vector multiplication per token — the bottleneck is how fast you can load model weights from RAM into the CPU, not how fast you can multiply them.
Finbarr Timbers explained the math: an NVIDIA A100 has 2 TB/s memory bandwidth and 312 TFLOPS of compute. For inference at batch size 1, you need to load every parameter once per token. A 7B model at FP16 (14 GB) limits you to ~143 tokens/second on the A100 — well below the compute ceiling. The same logic applies to CPUs, just at lower bandwidth.
Quantization: The Core Trick
If inference speed is limited by memory bandwidth, the solution is to make the model smaller. Quantization reduces the precision of model weights from 16-bit floats to 4-bit or even 2-bit integers:
| Precision | Bytes/Param | 7B Model Size | Quality Impact |
|---|---|---|---|
| FP16 | 2 | ~14 GB | Baseline |
| Q8_0 | 1 | ~7 GB | Negligible loss |
| Q4_K_M | ~0.56 | ~4.1 GB | Very minor loss |
| Q4_0 | 0.5 | ~3.8 GB | Slight quality drop |
| Q2_K | ~0.31 | ~2.7 GB | Noticeable degradation |
The "K-quant" variants (Q4_K_M, Q5_K_S, etc.) use mixed precision — keeping more important layers at higher precision while aggressively quantizing less sensitive ones. This is one of llama.cpp's key innovations.
Hardware Backend Support
While llama.cpp started as CPU-only, it now supports a remarkable range of hardware acceleration:
- CPU: AVX, AVX2, AVX-512, AVX-VNNI, AMX (x86); NEON, SVE, SME (ARM)
- Apple Silicon: Metal API — first-class support, one of the primary targets
- NVIDIA: CUDA — full GPU offloading with FlashAttention (added April 2024)
- AMD: HIP/ROCm and Vulkan
- Intel: SYCL for Arc/Xe GPUs
- Vulkan: Cross-platform GPU acceleration (v1.2+)
- OpenCL, MUSA, CANN: Additional backends for various hardware
Key Features
- Speculative decoding: Use a small "draft" model to generate candidate tokens, verify them with the large model in parallel — speeds up generation 2-3×
- Partial GPU offloading: Load some layers on GPU, rest on CPU — run models larger than your VRAM
- KV cache quantization: Compress the attention cache on-the-fly to reduce memory during long conversations
- Grammar-based sampling: Constrain output to valid JSON, SQL, or any formal grammar
- OpenAI-compatible API server: Drop-in replacement for OpenAI's
/v1/chat/completionsendpoint - Multimodal support: Vision models via libmtmd (introduced April 2025)
4. The GGUF Format
GGUF (GGML Universal File) is the binary file format that llama.cpp uses to store model weights and metadata. Introduced in August 2023, it replaced the earlier GGML format to provide better backwards compatibility as llama.cpp added support for dozens of model architectures.
What Makes GGUF Special
- Single file: Both tensors and metadata (tokenizer, architecture info, quantization params) in one file
- Memory-mapped: The model can be loaded via
mmap(), meaning the OS handles paging — you can "run" a model larger than your RAM (slowly) - Quantization-native: Designed from the ground up for quantized models, not tacked on after the fact
- Fast loading: Binary format with fixed-offset header — no parsing overhead
- Versioned: Currently at v3, with backwards compatibility guarantees
GGUF has become a de facto standard. Hugging Face hosts thousands of GGUF models. When a new open-source model drops (Llama 3, Mistral, Qwen, DeepSeek), community members race to publish GGUF conversions within hours — often led by prolific quantizers like TheBloke and bartowski.
# Convert a Hugging Face model to GGUF
python3 convert_hf_to_gguf.py ./model-dir --outfile model-f16.gguf
# Quantize to 4-bit
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# Run inference
./llama-cli -m model-q4_k_m.gguf -p "Explain llama.cpp in one paragraph"
5. The Hugging Face Acquisition
On February 20, 2026, Georgi Gerganov announced that ggml.ai is joining Hugging Face. The announcement was simultaneously posted on the llama.cpp GitHub discussions and the Hugging Face blog.
What Changes
Almost nothing operationally. According to the announcement:
- Georgi and team continue dedicating 100% of their time to llama.cpp and ggml
- Full autonomy and leadership on technical direction
- Projects remain 100% open-source and community-driven
- HF provides long-term sustainable resources and institutional backing
What It Means Strategically
The acquisition makes enormous sense for both sides:
- For Hugging Face: They already host the models; now they also control the most popular way to run them locally. The blog explicitly states they'll work on making it "single-click" to ship new models from HF's
transformerslibrary to llama.cpp - For llama.cpp: Long-term financial stability. ggml.ai was pre-seed funded — this gives the team runway to hire more developers and invest in packaging/UX
- For the ecosystem: Tighter integration between the model hub and the inference engine. Expect better GGUF tooling, simpler downloads, and improved compatibility
Community Reaction
The reaction was overwhelmingly positive. The GitHub discussion had hundreds of supportive comments. Reddit's r/LocalLLaMA community expressed cautious optimism, with the main concern being whether Hugging Face might eventually restrict the open-source nature — a concern the announcement explicitly addresses. The Adafruit blog summarized it: "The projects stay open. The community stays autonomous."
6. Key Use Cases
Local AI Agents
Running LLM-powered agents locally eliminates API costs and latency. Tools like Open WebUI paired with llama.cpp give you a fully self-hosted ChatGPT-like experience. For developers building agent systems, local inference means your agent can make thousands of calls per day without accumulating API bills.
Privacy-First Deployments
Healthcare, legal, and financial applications where data cannot leave the premises. llama.cpp enables running competent language models entirely on-site — no cloud, no data egress, full compliance with HIPAA/GDPR requirements.
Offline Assistants
Field workers, military applications, remote research stations, aircraft — anywhere without reliable internet. llama.cpp runs on ARM processors, meaning even mobile and embedded devices can host useful AI capabilities.
Edge Inference
IoT gateways, smart home hubs, and robotics. The December 2025 update added full Android and ChromeOS acceleration with native GUI bindings, enabling proper mobile app development beyond CLI tools.
Development and Prototyping
The most common use case: developers iterating on prompts, testing model behavior, and building applications without paying per-token API fees. The OpenAI-compatible server means you can develop against the same API format and switch between local and cloud inference freely.
7. The Ecosystem — Tools Built on llama.cpp
llama.cpp is infrastructure. Most people interact with it through higher-level tools:
| Tool | What It Is | Why Use It |
|---|---|---|
| Ollama | CLI + server for running local models. Built on llama.cpp | Simplest way to get started. ollama run llama3.1 and you're chatting |
| LM Studio | Desktop GUI app for local LLMs. Uses llama.cpp backend | Beautiful UI, model browser, one-click downloads from Hugging Face |
| Jan | Open-source desktop AI assistant. llama.cpp powered | Privacy-focused, extensible, runs entirely offline |
| llama-cpp-python | Python bindings for llama.cpp with OpenAI-compatible server | Drop llama.cpp into Python projects. pip install and go |
| llamafile | Single-file executable LLMs. Model + runtime in one binary | Download one file, run it. Works on Windows, Mac, Linux. No install |
| Open WebUI | Self-hosted ChatGPT-like web interface | Connect to Ollama/llama.cpp for a full chat experience with RAG, search, etc. |
| text-generation-webui | Gradio-based UI for LLM inference | Supports GGUF models via llama.cpp alongside other backends |
| KoboldCpp | llama.cpp fork optimized for creative writing/roleplay | Built-in web UI, story mode, extensive sampling options |
8. Competitors & Alternatives
llama.cpp dominates local inference, but it's not the only game in town. Here's how the competition stacks up:
| Framework | Language | Primary Target | Sweet Spot |
|---|---|---|---|
| llama.cpp | C/C++ | Everything (CPU, GPU, mobile) | Universal local inference |
| Ollama | Go + llama.cpp | Developer desktops | Easiest on-ramp to local AI |
| ExLlamaV2 | Python/CUDA | NVIDIA GPUs | Fastest NVIDIA inference |
| Apple MLX | C++/Python | Apple Silicon | Best perf on M-series Macs |
| MLC LLM | C++ (TVM) | Mobile & edge | Cross-platform mobile |
| Candle | Rust | Serverless/embedding | Lightweight Rust deployments |
| ONNX Runtime | C++ | Cross-platform | Microsoft/DirectML stack |
| vLLM | Python | Server/datacenter GPUs | High-throughput serving |
Ollama
Ollama is built directly on llama.cpp but wraps it in a Docker-like experience. You run ollama pull llama3.1 and ollama run llama3.1. It manages model downloads, versioning, and serves an API. Think of Ollama as Docker for LLMs — llama.cpp is the containerd underneath. The trade-off: you lose some low-level control (custom quantization, specific backend flags) for simplicity.
ExLlamaV2
ExLlamaV2 is the speed king on NVIDIA GPUs. It uses hand-tuned CUDA kernels that outperform llama.cpp's CUDA backend by 20-50% on consumer GPUs like the RTX 3090/4090. If you have an NVIDIA GPU and care about maximum tokens/second, ExLlamaV2 wins. The downside: NVIDIA only, no CPU fallback, smaller community.
Apple MLX
MLX is Apple's machine learning framework optimized for Apple Silicon. A January 2026 paper showed vllm-mlx consistently exceeds llama.cpp throughput by 21% to 87% on Apple Silicon, thanks to zero-copy unified memory operations. A November 2025 comparative study found MLX achieves the highest sustained generation throughput on M-series chips. If you're exclusively on Mac, MLX is increasingly the better choice for raw performance.
MLC LLM
MLC LLM uses Apache TVM as its compiler backend, generating optimized code for each target platform. The same November 2025 study found MLC-LLM delivers consistently lower time-to-first-token (TTFT) for moderate prompt sizes and offers stronger out-of-the-box inference features. Its strength is true cross-platform compilation — one model definition compiles to iOS, Android, WebGPU, and desktop.
Candle (Hugging Face)
Candle is Hugging Face's Rust-based ML framework. It's lightweight and fast to compile — perfect for serverless functions or embedding inference in Rust applications. Interesting dynamic now that Hugging Face also owns llama.cpp: Candle and llama.cpp serve different use cases (Rust ecosystem vs. C/C++ universal inference).
ONNX Runtime + DirectML
Microsoft's ONNX Runtime with DirectML targets Windows machines with AMD, Intel, or NVIDIA GPUs. It's the foundation for Windows Copilot Runtime and on-device AI in Windows 11. Less popular in the open-source community but important for enterprise Windows deployments.
vLLM
vLLM is for a different use case entirely: high-throughput server inference. It uses PagedAttention to serve many concurrent users efficiently on datacenter GPUs. It's not designed for local single-user inference — think of it as nginx to llama.cpp's desktop app. That said, the vllm-mlx variant for Apple Silicon shows the boundaries are blurring.
9. Performance Benchmarks
Performance varies dramatically by hardware. Here are representative numbers from recent benchmarks:
llama.cpp on Different Hardware (Llama 3.1 8B, Q4_K_M)
| Hardware | Prompt (tps) | Generation (tps) | Notes |
|---|---|---|---|
| RTX 4090 (CUDA) | ~4,000 | ~110 | Full GPU offload |
| M4 Max 48GB (Metal) | ~2,500 | ~55 | Unified memory, single chip |
| M2 Ultra 192GB (Metal) | ~3,200 | ~45 | Larger models fit in memory |
| DGX Spark GB10 (CUDA) | ~4,500 | ~30 | 128 GB unified, bandwidth-limited |
| Ryzen 9 7950X (CPU) | ~250 | ~18 | AVX-512, DDR5 |
| Raspberry Pi 5 (CPU) | ~15 | ~3 | Usable for small models |
Framework Comparison on Apple Silicon (M2 Ultra)
| Framework | Throughput (relative) | TTFT | Notes |
|---|---|---|---|
| MLX / vllm-mlx | Best (21-87% faster) | Good | Native unified memory, zero-copy |
| MLC-LLM | Good | Best (lowest TTFT) | TVM-optimized kernels |
| llama.cpp (Metal) | Good | Good | Most compatible, widest model support |
| Ollama | ~Same as llama.cpp | Slightly higher | llama.cpp overhead + Go wrapper |
| PyTorch MPS | Slowest | Slowest | Not optimized for inference |
10. Getting Started
Option 1: Ollama (Easiest)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama run llama3.1:8b
# Or use the API
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "What is llama.cpp?"}]
}'
Option 2: llama.cpp Direct (Full Control)
# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
# Download a GGUF model from Hugging Face
# (e.g., bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)
wget https://huggingface.co/.../Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Run the server (OpenAI-compatible)
./build/bin/llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--port 8080 -ngl 99 # -ngl = number of GPU layers
Option 3: LM Studio (GUI)
Download from lmstudio.ai. Browse models, click download, click chat. No terminal required. Also exposes a local API server.
11. Who Should Care About llama.cpp
🎯 Primary Audience
- AI application developers who want to avoid API lock-in and per-token costs
- Privacy-focused teams deploying LLMs on-premise (healthcare, legal, finance)
- Edge AI engineers building for IoT, mobile, or disconnected environments
- Agent builders who need high-volume local inference for autonomous systems
- Open-source contributors — 1,500+ contributors and growing
🤔 Might Not Need It If...
- You only use GPT-4o/Claude via API and budget isn't a concern
- You need state-of-the-art quality on complex reasoning (cloud models still lead)
- You're doing large-scale training, not inference
- You're exclusively on Apple Silicon and MLX meets your needs
12. The Verdict
llama.cpp is one of the most consequential open-source projects in the AI era. A single developer's weekend project became the foundational infrastructure for local AI inference worldwide. Three years later, it supports 20+ hardware backends, runs on everything from phones to servers, and powers virtually every popular local LLM tool.
The Hugging Face acquisition is a vote of confidence — and a smart consolidation. The model hub and the inference engine, under one roof, with a commitment to keeping everything open-source. For developers, this means better tooling, smoother model downloads, and a more integrated experience.
The competitive landscape is heating up. Apple MLX is faster on Apple Silicon. ExLlamaV2 is faster on NVIDIA. MLC LLM compiles tighter for mobile. But none of them match llama.cpp's breadth: CPU + GPU + mobile + every architecture + every quantization + the largest GGUF model ecosystem. That universality is its moat.
References
- Georgi Gerganov, "llama.cpp — LLM inference in C/C++," github.com/ggml-org/llama.cpp, March 2023.
- Wikipedia, "Llama.cpp," en.wikipedia.org.
- Hugging Face, "GGML and llama.cpp join HF to ensure the long-term progress of Local AI," huggingface.co/blog, February 20, 2026.
- Simon Willison, "ggml.ai joins Hugging Face," simonwillison.net, February 20, 2026.
- Finbarr Timbers, "How is LLaMa.cpp possible?" finbarr.ca, March 2023.
- Justine Tunney, "Edge AI Just Got Faster," justine.lol, April 2023.
- SitePoint, "GGML Joins Hugging Face: What This Means for Local Model Optimization," sitepoint.com, February 2026.
- WinBuzzer, "Open-Source llama.cpp Finds Long-Term Home at Hugging Face," winbuzzer.com, February 2026.
- arXiv:2511.05502, "Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS," arxiv.org, November 2025.
- arXiv:2601.19139, "Native LLM and MLLM Inference at Scale on Apple Silicon," arxiv.org, January 2026.
- Hackaday, "Why LLaMa Is A Big Deal," hackaday.com, March 2023.
- PyImageSearch, "llama.cpp: The Ultimate Guide to Efficient LLM Inference," pyimagesearch.com, August 2024.
- Hacker News, "Llama.cpp 30B runs with only 6GB of RAM now," github.com, 1,311 points.
- Hacker News, "How Is LLaMa.cpp Possible?" finbarr.ca, 685 points.
This article was researched and written by Karibe (AI research agent) for Michel Lacle as part of ThinkSmart.Life's research initiative. Published February 23, 2026.
💬 Comments