📺 Watch Video
🎧 Listen

1. Introduction

On March 10, 2023, a Bulgarian software engineer named Georgi Gerganov pushed a commit to GitHub that would reshape the AI landscape. His project, llama.cpp, implemented Meta's newly released LLaMA model in pure C/C++ — no Python, no PyTorch, no GPU required. Within days, people were running large language models on MacBooks, Raspberry Pis, and Android phones.

Nearly three years later, llama.cpp has over 85,000 GitHub stars, powers virtually every major local AI tool, and on February 20, 2026, the team behind it officially joined Hugging Face — the largest open-source AI platform in the world.

This guide covers everything: the origin story, how it works under the hood, the GGUF file format, the Hugging Face acquisition, every major tool built on top of llama.cpp, and every serious competitor trying to do the same thing differently.

2. The Origin Story

Georgi Gerganov had been building tensor computation libraries long before LLMs went mainstream. In late September 2022, he started work on GGML (Georgi Gerganov Machine Learning) — a lightweight C library for tensor algebra, inspired by Fabrice Bellard's work on LibNC. The design priorities were strict memory management and multi-threading — no Python runtime, no framework overhead.

His first big project with GGML was whisper.cpp — a C/C++ implementation of OpenAI's Whisper speech recognition model. It proved the concept: you could run serious neural networks on consumer CPUs with good performance by being smart about memory and computation.

Then, on February 24, 2023, Meta released the LLaMA (Large Language Model Meta AI) weights. Gerganov realized he could port the inference code to his C tensor library. In about a weekend of intense coding, llama.cpp was born — released on March 10, 2023.

The initial results were electric:

The project exploded. Within a month it had 19,000 GitHub stars. Justine Tunney (of Mozilla/Cosmopolitan fame) contributed major optimizations to memory mapping. The community contributed GPU backends, new quantization schemes, and support for dozens of model architectures beyond LLaMA.

🔑 Why It Mattered Before llama.cpp, running an LLM locally required Python, PyTorch, CUDA, and an expensive NVIDIA GPU. Gerganov showed that with careful engineering — 4-bit quantization and optimized C code — you could run a 7B parameter model on hardware most developers already owned. He democratized local AI inference overnight.

Funding and ggml.ai

In June 2023, Gerganov founded ggml.ai to support full-time development. Nat Friedman (former GitHub CEO) and Daniel Gross provided pre-seed funding. The company hired full-time developers to maintain ggml and llama.cpp while keeping both projects fully open-source under the MIT license.

3. How It Works

The key insight behind llama.cpp is that LLM inference is memory-bandwidth bound, not compute-bound. During text generation (decoding), you're doing one matrix-vector multiplication per token — the bottleneck is how fast you can load model weights from RAM into the CPU, not how fast you can multiply them.

Finbarr Timbers explained the math: an NVIDIA A100 has 2 TB/s memory bandwidth and 312 TFLOPS of compute. For inference at batch size 1, you need to load every parameter once per token. A 7B model at FP16 (14 GB) limits you to ~143 tokens/second on the A100 — well below the compute ceiling. The same logic applies to CPUs, just at lower bandwidth.

Quantization: The Core Trick

If inference speed is limited by memory bandwidth, the solution is to make the model smaller. Quantization reduces the precision of model weights from 16-bit floats to 4-bit or even 2-bit integers:

Precision Bytes/Param 7B Model Size Quality Impact
FP162~14 GBBaseline
Q8_01~7 GBNegligible loss
Q4_K_M~0.56~4.1 GBVery minor loss
Q4_00.5~3.8 GBSlight quality drop
Q2_K~0.31~2.7 GBNoticeable degradation

The "K-quant" variants (Q4_K_M, Q5_K_S, etc.) use mixed precision — keeping more important layers at higher precision while aggressively quantizing less sensitive ones. This is one of llama.cpp's key innovations.

Hardware Backend Support

While llama.cpp started as CPU-only, it now supports a remarkable range of hardware acceleration:

Key Features

4. The GGUF Format

GGUF (GGML Universal File) is the binary file format that llama.cpp uses to store model weights and metadata. Introduced in August 2023, it replaced the earlier GGML format to provide better backwards compatibility as llama.cpp added support for dozens of model architectures.

What Makes GGUF Special

GGUF has become a de facto standard. Hugging Face hosts thousands of GGUF models. When a new open-source model drops (Llama 3, Mistral, Qwen, DeepSeek), community members race to publish GGUF conversions within hours — often led by prolific quantizers like TheBloke and bartowski.

# Convert a Hugging Face model to GGUF
python3 convert_hf_to_gguf.py ./model-dir --outfile model-f16.gguf

# Quantize to 4-bit
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Run inference
./llama-cli -m model-q4_k_m.gguf -p "Explain llama.cpp in one paragraph"

5. The Hugging Face Acquisition

On February 20, 2026, Georgi Gerganov announced that ggml.ai is joining Hugging Face. The announcement was simultaneously posted on the llama.cpp GitHub discussions and the Hugging Face blog.

What Changes

Almost nothing operationally. According to the announcement:

What It Means Strategically

The acquisition makes enormous sense for both sides:

🔑 Simon Willison's Take "It's hard to overstate the impact Georgi Gerganov has had on the local model space. Back in March 2023 his release of llama.cpp made it possible for anyone to run LLMs on their own hardware." He called the HF acquisition a natural fit — the model hub and the inference engine, together.

Community Reaction

The reaction was overwhelmingly positive. The GitHub discussion had hundreds of supportive comments. Reddit's r/LocalLLaMA community expressed cautious optimism, with the main concern being whether Hugging Face might eventually restrict the open-source nature — a concern the announcement explicitly addresses. The Adafruit blog summarized it: "The projects stay open. The community stays autonomous."

6. Key Use Cases

Local AI Agents

Running LLM-powered agents locally eliminates API costs and latency. Tools like Open WebUI paired with llama.cpp give you a fully self-hosted ChatGPT-like experience. For developers building agent systems, local inference means your agent can make thousands of calls per day without accumulating API bills.

Privacy-First Deployments

Healthcare, legal, and financial applications where data cannot leave the premises. llama.cpp enables running competent language models entirely on-site — no cloud, no data egress, full compliance with HIPAA/GDPR requirements.

Offline Assistants

Field workers, military applications, remote research stations, aircraft — anywhere without reliable internet. llama.cpp runs on ARM processors, meaning even mobile and embedded devices can host useful AI capabilities.

Edge Inference

IoT gateways, smart home hubs, and robotics. The December 2025 update added full Android and ChromeOS acceleration with native GUI bindings, enabling proper mobile app development beyond CLI tools.

Development and Prototyping

The most common use case: developers iterating on prompts, testing model behavior, and building applications without paying per-token API fees. The OpenAI-compatible server means you can develop against the same API format and switch between local and cloud inference freely.

7. The Ecosystem — Tools Built on llama.cpp

llama.cpp is infrastructure. Most people interact with it through higher-level tools:

Tool What It Is Why Use It
Ollama CLI + server for running local models. Built on llama.cpp Simplest way to get started. ollama run llama3.1 and you're chatting
LM Studio Desktop GUI app for local LLMs. Uses llama.cpp backend Beautiful UI, model browser, one-click downloads from Hugging Face
Jan Open-source desktop AI assistant. llama.cpp powered Privacy-focused, extensible, runs entirely offline
llama-cpp-python Python bindings for llama.cpp with OpenAI-compatible server Drop llama.cpp into Python projects. pip install and go
llamafile Single-file executable LLMs. Model + runtime in one binary Download one file, run it. Works on Windows, Mac, Linux. No install
Open WebUI Self-hosted ChatGPT-like web interface Connect to Ollama/llama.cpp for a full chat experience with RAG, search, etc.
text-generation-webui Gradio-based UI for LLM inference Supports GGUF models via llama.cpp alongside other backends
KoboldCpp llama.cpp fork optimized for creative writing/roleplay Built-in web UI, story mode, extensive sampling options
✅ The GGUF Effect Because all these tools speak GGUF, you can download a model once and use it everywhere. A GGUF file from Hugging Face works in Ollama, LM Studio, Jan, llamafile, and raw llama.cpp interchangeably. No other format has this level of cross-tool compatibility.

8. Competitors & Alternatives

llama.cpp dominates local inference, but it's not the only game in town. Here's how the competition stacks up:

Framework Language Primary Target Sweet Spot
llama.cppC/C++Everything (CPU, GPU, mobile)Universal local inference
OllamaGo + llama.cppDeveloper desktopsEasiest on-ramp to local AI
ExLlamaV2Python/CUDANVIDIA GPUsFastest NVIDIA inference
Apple MLXC++/PythonApple SiliconBest perf on M-series Macs
MLC LLMC++ (TVM)Mobile & edgeCross-platform mobile
CandleRustServerless/embeddingLightweight Rust deployments
ONNX RuntimeC++Cross-platformMicrosoft/DirectML stack
vLLMPythonServer/datacenter GPUsHigh-throughput serving

Ollama

Ollama is built directly on llama.cpp but wraps it in a Docker-like experience. You run ollama pull llama3.1 and ollama run llama3.1. It manages model downloads, versioning, and serves an API. Think of Ollama as Docker for LLMs — llama.cpp is the containerd underneath. The trade-off: you lose some low-level control (custom quantization, specific backend flags) for simplicity.

ExLlamaV2

ExLlamaV2 is the speed king on NVIDIA GPUs. It uses hand-tuned CUDA kernels that outperform llama.cpp's CUDA backend by 20-50% on consumer GPUs like the RTX 3090/4090. If you have an NVIDIA GPU and care about maximum tokens/second, ExLlamaV2 wins. The downside: NVIDIA only, no CPU fallback, smaller community.

Apple MLX

MLX is Apple's machine learning framework optimized for Apple Silicon. A January 2026 paper showed vllm-mlx consistently exceeds llama.cpp throughput by 21% to 87% on Apple Silicon, thanks to zero-copy unified memory operations. A November 2025 comparative study found MLX achieves the highest sustained generation throughput on M-series chips. If you're exclusively on Mac, MLX is increasingly the better choice for raw performance.

MLC LLM

MLC LLM uses Apache TVM as its compiler backend, generating optimized code for each target platform. The same November 2025 study found MLC-LLM delivers consistently lower time-to-first-token (TTFT) for moderate prompt sizes and offers stronger out-of-the-box inference features. Its strength is true cross-platform compilation — one model definition compiles to iOS, Android, WebGPU, and desktop.

Candle (Hugging Face)

Candle is Hugging Face's Rust-based ML framework. It's lightweight and fast to compile — perfect for serverless functions or embedding inference in Rust applications. Interesting dynamic now that Hugging Face also owns llama.cpp: Candle and llama.cpp serve different use cases (Rust ecosystem vs. C/C++ universal inference).

ONNX Runtime + DirectML

Microsoft's ONNX Runtime with DirectML targets Windows machines with AMD, Intel, or NVIDIA GPUs. It's the foundation for Windows Copilot Runtime and on-device AI in Windows 11. Less popular in the open-source community but important for enterprise Windows deployments.

vLLM

vLLM is for a different use case entirely: high-throughput server inference. It uses PagedAttention to serve many concurrent users efficiently on datacenter GPUs. It's not designed for local single-user inference — think of it as nginx to llama.cpp's desktop app. That said, the vllm-mlx variant for Apple Silicon shows the boundaries are blurring.

9. Performance Benchmarks

Performance varies dramatically by hardware. Here are representative numbers from recent benchmarks:

llama.cpp on Different Hardware (Llama 3.1 8B, Q4_K_M)

Hardware Prompt (tps) Generation (tps) Notes
RTX 4090 (CUDA)~4,000~110Full GPU offload
M4 Max 48GB (Metal)~2,500~55Unified memory, single chip
M2 Ultra 192GB (Metal)~3,200~45Larger models fit in memory
DGX Spark GB10 (CUDA)~4,500~30128 GB unified, bandwidth-limited
Ryzen 9 7950X (CPU)~250~18AVX-512, DDR5
Raspberry Pi 5 (CPU)~15~3Usable for small models

Framework Comparison on Apple Silicon (M2 Ultra)

Framework Throughput (relative) TTFT Notes
MLX / vllm-mlxBest (21-87% faster)GoodNative unified memory, zero-copy
MLC-LLMGoodBest (lowest TTFT)TVM-optimized kernels
llama.cpp (Metal)GoodGoodMost compatible, widest model support
Ollama~Same as llama.cppSlightly higherllama.cpp overhead + Go wrapper
PyTorch MPSSlowestSlowestNot optimized for inference
⚠️ Benchmark Caveats Performance numbers vary significantly based on model architecture, quantization level, context length, batch size, and driver versions. The comparison above uses data from the November 2025 arXiv paper (2511.05502) and the January 2026 vllm-mlx paper. Always benchmark on your specific hardware and workload.

10. Getting Started

Option 1: Ollama (Easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama run llama3.1:8b

# Or use the API
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [{"role": "user", "content": "What is llama.cpp?"}]
}'

Option 2: llama.cpp Direct (Full Control)

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

# Download a GGUF model from Hugging Face
# (e.g., bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)
wget https://huggingface.co/.../Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run the server (OpenAI-compatible)
./build/bin/llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --port 8080 -ngl 99  # -ngl = number of GPU layers

Option 3: LM Studio (GUI)

Download from lmstudio.ai. Browse models, click download, click chat. No terminal required. Also exposes a local API server.

11. Who Should Care About llama.cpp

🎯 Primary Audience

  • AI application developers who want to avoid API lock-in and per-token costs
  • Privacy-focused teams deploying LLMs on-premise (healthcare, legal, finance)
  • Edge AI engineers building for IoT, mobile, or disconnected environments
  • Agent builders who need high-volume local inference for autonomous systems
  • Open-source contributors — 1,500+ contributors and growing

🤔 Might Not Need It If...

  • You only use GPT-4o/Claude via API and budget isn't a concern
  • You need state-of-the-art quality on complex reasoning (cloud models still lead)
  • You're doing large-scale training, not inference
  • You're exclusively on Apple Silicon and MLX meets your needs

12. The Verdict

llama.cpp is one of the most consequential open-source projects in the AI era. A single developer's weekend project became the foundational infrastructure for local AI inference worldwide. Three years later, it supports 20+ hardware backends, runs on everything from phones to servers, and powers virtually every popular local LLM tool.

The Hugging Face acquisition is a vote of confidence — and a smart consolidation. The model hub and the inference engine, under one roof, with a commitment to keeping everything open-source. For developers, this means better tooling, smoother model downloads, and a more integrated experience.

The competitive landscape is heating up. Apple MLX is faster on Apple Silicon. ExLlamaV2 is faster on NVIDIA. MLC LLM compiles tighter for mobile. But none of them match llama.cpp's breadth: CPU + GPU + mobile + every architecture + every quantization + the largest GGUF model ecosystem. That universality is its moat.

✅ Bottom Line If you're building anything that involves running language models locally — whether it's a chatbot, an agent, an offline assistant, or a privacy-first enterprise deployment — llama.cpp is the foundation you're building on, directly or indirectly. Understanding it isn't optional anymore. It's infrastructure.

References

  1. Georgi Gerganov, "llama.cpp — LLM inference in C/C++," github.com/ggml-org/llama.cpp, March 2023.
  2. Wikipedia, "Llama.cpp," en.wikipedia.org.
  3. Hugging Face, "GGML and llama.cpp join HF to ensure the long-term progress of Local AI," huggingface.co/blog, February 20, 2026.
  4. Simon Willison, "ggml.ai joins Hugging Face," simonwillison.net, February 20, 2026.
  5. Finbarr Timbers, "How is LLaMa.cpp possible?" finbarr.ca, March 2023.
  6. Justine Tunney, "Edge AI Just Got Faster," justine.lol, April 2023.
  7. SitePoint, "GGML Joins Hugging Face: What This Means for Local Model Optimization," sitepoint.com, February 2026.
  8. WinBuzzer, "Open-Source llama.cpp Finds Long-Term Home at Hugging Face," winbuzzer.com, February 2026.
  9. arXiv:2511.05502, "Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS," arxiv.org, November 2025.
  10. arXiv:2601.19139, "Native LLM and MLLM Inference at Scale on Apple Silicon," arxiv.org, January 2026.
  11. Hackaday, "Why LLaMa Is A Big Deal," hackaday.com, March 2023.
  12. PyImageSearch, "llama.cpp: The Ultimate Guide to Efficient LLM Inference," pyimagesearch.com, August 2024.
  13. Hacker News, "Llama.cpp 30B runs with only 6GB of RAM now," github.com, 1,311 points.
  14. Hacker News, "How Is LLaMa.cpp Possible?" finbarr.ca, 685 points.

💬 Comments

This article was researched and written by Karibe (AI research agent) for Michel Lacle as part of ThinkSmart.Life's research initiative. Published February 23, 2026.

🛡️ No Third-Party Tracking