Research Open Source Local AI

llama.cpp — What It Is, What It Does, and Who's Competing With It

How a Bulgarian developer built the engine that powers local AI inference on everything from MacBooks to Raspberry Pis — and why Hugging Face just acquired the team behind it. A complete guide to the local LLM landscape in 2026.

Michel Lacle & Karibe | ThinkSmart.Life Research

February 23, 2026 · min read

📺 Watch Video

🎧 Listen

1. Introduction

On March 10, 2023, a Bulgarian software engineer named Georgi Gerganov pushed a commit to GitHub that would reshape the AI landscape. His project, llama.cpp, implemented Meta's newly released LLaMA model in pure C/C++ — no Python, no PyTorch, no GPU required. Within days, people were running large language models on MacBooks, Raspberry Pis, and Android phones.

Nearly three years later, llama.cpp has over 85,000 GitHub stars, powers virtually every major local AI tool, and on February 20, 2026, the team behind it officially joined Hugging Face — the largest open-source AI platform in the world.

This guide covers everything: the origin story, how it works under the hood, the GGUF file format, the Hugging Face acquisition, every major tool built on top of llama.cpp, and every serious competitor trying to do the same thing differently.

2. The Origin Story

Georgi Gerganov had been building tensor computation libraries long before LLMs went mainstream. In late September 2022, he started work on GGML (Georgi Gerganov Machine Learning) — a lightweight C library for tensor algebra, inspired by Fabrice Bellard's work on LibNC. The design priorities were strict memory management and multi-threading — no Python runtime, no framework overhead.

His first big project with GGML was whisper.cpp — a C/C++ implementation of OpenAI's Whisper speech recognition model. It proved the concept: you could run serious neural networks on consumer CPUs with good performance by being smart about memory and computation.

Then, on February 24, 2023, Meta released the LLaMA (Large Language Model Meta AI) weights. Gerganov realized he could port the inference code to his C tensor library. In about a weekend of intense coding, llama.cpp was born — released on March 10, 2023.

The initial results were electric:

M2 MacBook Pro: ~16 tokens/second with the 7B model
Pixel 5 phone: ~1 token/second with the 7B model
4 GB RAM Raspberry Pi: ~0.1 tokens/second — slow but it worked

The project exploded. Within a month it had 19,000 GitHub stars. Justine Tunney (of Mozilla/Cosmopolitan fame) contributed major optimizations to memory mapping. The community contributed GPU backends, new quantization schemes, and support for dozens of model architectures beyond LLaMA.

🔑 Why It Mattered Before llama.cpp, running an LLM locally required Python, PyTorch, CUDA, and an expensive NVIDIA GPU. Gerganov showed that with careful engineering — 4-bit quantization and optimized C code — you could run a 7B parameter model on hardware most developers already owned. He democratized local AI inference overnight.

Funding and ggml.ai

In June 2023, Gerganov founded ggml.ai to support full-time development. Nat Friedman (former GitHub CEO) and Daniel Gross provided pre-seed funding. The company hired full-time developers to maintain ggml and llama.cpp while keeping both projects fully open-source under the MIT license.

3. How It Works

The key insight behind llama.cpp is that LLM inference is memory-bandwidth bound, not compute-bound. During text generation (decoding), you're doing one matrix-vector multiplication per token — the bottleneck is how fast you can load model weights from RAM into the CPU, not how fast you can multiply them.

Finbarr Timbers explained the math: an NVIDIA A100 has 2 TB/s memory bandwidth and 312 TFLOPS of compute. For inference at batch size 1, you need to load every parameter once per token. A 7B model at FP16 (14 GB) limits you to ~143 tokens/second on the A100 — well below the compute ceiling. The same logic applies to CPUs, just at lower bandwidth.

Quantization: The Core Trick

If inference speed is limited by memory bandwidth, the solution is to make the model smaller. Quantization reduces the precision of model weights from 16-bit floats to 4-bit or even 2-bit integers:

Precision	Bytes/Param	7B Model Size	Quality Impact
FP16	2	~14 GB	Baseline
Q8_0	1	~7 GB	Negligible loss
Q4_K_M	~0.56	~4.1 GB	Very minor loss
Q4_0	0.5	~3.8 GB	Slight quality drop
Q2_K	~0.31	~2.7 GB	Noticeable degradation

The "K-quant" variants (Q4_K_M, Q5_K_S, etc.) use mixed precision — keeping more important layers at higher precision while aggressively quantizing less sensitive ones. This is one of llama.cpp's key innovations.

Hardware Backend Support

While llama.cpp started as CPU-only, it now supports a remarkable range of hardware acceleration:

CPU: AVX, AVX2, AVX-512, AVX-VNNI, AMX (x86); NEON, SVE, SME (ARM)
Apple Silicon: Metal API — first-class support, one of the primary targets
NVIDIA: CUDA — full GPU offloading with FlashAttention (added April 2024)
AMD: HIP/ROCm and Vulkan
Intel: SYCL for Arc/Xe GPUs
Vulkan: Cross-platform GPU acceleration (v1.2+)
OpenCL, MUSA, CANN: Additional backends for various hardware

Key Features

Speculative decoding: Use a small "draft" model to generate candidate tokens, verify them with the large model in parallel — speeds up generation 2-3×
Partial GPU offloading: Load some layers on GPU, rest on CPU — run models larger than your VRAM
KV cache quantization: Compress the attention cache on-the-fly to reduce memory during long conversations
Grammar-based sampling: Constrain output to valid JSON, SQL, or any formal grammar
OpenAI-compatible API server: Drop-in replacement for OpenAI's /v1/chat/completions endpoint
Multimodal support: Vision models via libmtmd (introduced April 2025)

4. The GGUF Format

GGUF (GGML Universal File) is the binary file format that llama.cpp uses to store model weights and metadata. Introduced in August 2023, it replaced the earlier GGML format to provide better backwards compatibility as llama.cpp added support for dozens of model architectures.

What Makes GGUF Special

Single file: Both tensors and metadata (tokenizer, architecture info, quantization params) in one file
Memory-mapped: The model can be loaded via mmap(), meaning the OS handles paging — you can "run" a model larger than your RAM (slowly)
Quantization-native: Designed from the ground up for quantized models, not tacked on after the fact
Fast loading: Binary format with fixed-offset header — no parsing overhead
Versioned: Currently at v3, with backwards compatibility guarantees

GGUF has become a de facto standard. Hugging Face hosts thousands of GGUF models. When a new open-source model drops (Llama 3, Mistral, Qwen, DeepSeek), community members race to publish GGUF conversions within hours — often led by prolific quantizers like TheBloke and bartowski.

# Convert a Hugging Face model to GGUF
python3 convert_hf_to_gguf.py ./model-dir --outfile model-f16.gguf

# Quantize to 4-bit
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Run inference
./llama-cli -m model-q4_k_m.gguf -p "Explain llama.cpp in one paragraph"

5. The Hugging Face Acquisition

On February 20, 2026, Georgi Gerganov announced that ggml.ai is joining Hugging Face. The announcement was simultaneously posted on the llama.cpp GitHub discussions and the Hugging Face blog.

What Changes

Almost nothing operationally. According to the announcement:

Georgi and team continue dedicating 100% of their time to llama.cpp and ggml
Full autonomy and leadership on technical direction
Projects remain 100% open-source and community-driven
HF provides long-term sustainable resources and institutional backing

What It Means Strategically

The acquisition makes enormous sense for both sides:

For Hugging Face: They already host the models; now they also control the most popular way to run them locally. The blog explicitly states they'll work on making it "single-click" to ship new models from HF's transformers library to llama.cpp
For llama.cpp: Long-term financial stability. ggml.ai was pre-seed funded — this gives the team runway to hire more developers and invest in packaging/UX
For the ecosystem: Tighter integration between the model hub and the inference engine. Expect better GGUF tooling, simpler downloads, and improved compatibility

🔑 Simon Willison's Take "It's hard to overstate the impact Georgi Gerganov has had on the local model space. Back in March 2023 his release of llama.cpp made it possible for anyone to run LLMs on their own hardware." He called the HF acquisition a natural fit — the model hub and the inference engine, together.

Community Reaction

The reaction was overwhelmingly positive. The GitHub discussion had hundreds of supportive comments. Reddit's r/LocalLLaMA community expressed cautious optimism, with the main concern being whether Hugging Face might eventually restrict the open-source nature — a concern the announcement explicitly addresses. The Adafruit blog summarized it: "The projects stay open. The community stays autonomous."

6. Key Use Cases

Local AI Agents

Running LLM-powered agents locally eliminates API costs and latency. Tools like Open WebUI paired with llama.cpp give you a fully self-hosted ChatGPT-like experience. For developers building agent systems, local inference means your agent can make thousands of calls per day without accumulating API bills.

Privacy-First Deployments

Healthcare, legal, and financial applications where data cannot leave the premises. llama.cpp enables running competent language models entirely on-site — no cloud, no data egress, full compliance with HIPAA/GDPR requirements.

Offline Assistants

Field workers, military applications, remote research stations, aircraft — anywhere without reliable internet. llama.cpp runs on ARM processors, meaning even mobile and embedded devices can host useful AI capabilities.

Edge Inference

IoT gateways, smart home hubs, and robotics. The December 2025 update added full Android and ChromeOS acceleration with native GUI bindings, enabling proper mobile app development beyond CLI tools.

Development and Prototyping

The most common use case: developers iterating on prompts, testing model behavior, and building applications without paying per-token API fees. The OpenAI-compatible server means you can develop against the same API format and switch between local and cloud inference freely.

7. The Ecosystem — Tools Built on llama.cpp

llama.cpp is infrastructure. Most people interact with it through higher-level tools:

Tool	What It Is	Why Use It
Ollama	CLI + server for running local models. Built on llama.cpp	Simplest way to get started. `ollama run llama3.1` and you're chatting
LM Studio	Desktop GUI app for local LLMs. Uses llama.cpp backend	Beautiful UI, model browser, one-click downloads from Hugging Face
Jan	Open-source desktop AI assistant. llama.cpp powered	Privacy-focused, extensible, runs entirely offline
llama-cpp-python	Python bindings for llama.cpp with OpenAI-compatible server	Drop llama.cpp into Python projects. pip install and go
llamafile	Single-file executable LLMs. Model + runtime in one binary	Download one file, run it. Works on Windows, Mac, Linux. No install
Open WebUI	Self-hosted ChatGPT-like web interface	Connect to Ollama/llama.cpp for a full chat experience with RAG, search, etc.
text-generation-webui	Gradio-based UI for LLM inference	Supports GGUF models via llama.cpp alongside other backends
KoboldCpp	llama.cpp fork optimized for creative writing/roleplay	Built-in web UI, story mode, extensive sampling options

✅ The GGUF Effect Because all these tools speak GGUF, you can download a model once and use it everywhere. A GGUF file from Hugging Face works in Ollama, LM Studio, Jan, llamafile, and raw llama.cpp interchangeably. No other format has this level of cross-tool compatibility.

8. Competitors & Alternatives

llama.cpp dominates local inference, but it's not the only game in town. Here's how the competition stacks up:

Framework	Language	Primary Target	Sweet Spot
llama.cpp	C/C++	Everything (CPU, GPU, mobile)	Universal local inference
Ollama	Go + llama.cpp	Developer desktops	Easiest on-ramp to local AI
ExLlamaV2	Python/CUDA	NVIDIA GPUs	Fastest NVIDIA inference
Apple MLX	C++/Python	Apple Silicon	Best perf on M-series Macs
MLC LLM	C++ (TVM)	Mobile & edge	Cross-platform mobile
Candle	Rust	Serverless/embedding	Lightweight Rust deployments
ONNX Runtime	C++	Cross-platform	Microsoft/DirectML stack
vLLM	Python	Server/datacenter GPUs	High-throughput serving

Ollama

Ollama is built directly on llama.cpp but wraps it in a Docker-like experience. You run ollama pull llama3.1 and ollama run llama3.1. It manages model downloads, versioning, and serves an API. Think of Ollama as Docker for LLMs — llama.cpp is the containerd underneath. The trade-off: you lose some low-level control (custom quantization, specific backend flags) for simplicity.

ExLlamaV2

ExLlamaV2 is the speed king on NVIDIA GPUs. It uses hand-tuned CUDA kernels that outperform llama.cpp's CUDA backend by 20-50% on consumer GPUs like the RTX 3090/4090. If you have an NVIDIA GPU and care about maximum tokens/second, ExLlamaV2 wins. The downside: NVIDIA only, no CPU fallback, smaller community.

Apple MLX

MLX is Apple's machine learning framework optimized for Apple Silicon. A January 2026 paper showed vllm-mlx consistently exceeds llama.cpp throughput by 21% to 87% on Apple Silicon, thanks to zero-copy unified memory operations. A November 2025 comparative study found MLX achieves the highest sustained generation throughput on M-series chips. If you're exclusively on Mac, MLX is increasingly the better choice for raw performance.

MLC LLM

MLC LLM uses Apache TVM as its compiler backend, generating optimized code for each target platform. The same November 2025 study found MLC-LLM delivers consistently lower time-to-first-token (TTFT) for moderate prompt sizes and offers stronger out-of-the-box inference features. Its strength is true cross-platform compilation — one model definition compiles to iOS, Android, WebGPU, and desktop.

Candle (Hugging Face)

Candle is Hugging Face's Rust-based ML framework. It's lightweight and fast to compile — perfect for serverless functions or embedding inference in Rust applications. Interesting dynamic now that Hugging Face also owns llama.cpp: Candle and llama.cpp serve different use cases (Rust ecosystem vs. C/C++ universal inference).

ONNX Runtime + DirectML

Microsoft's ONNX Runtime with DirectML targets Windows machines with AMD, Intel, or NVIDIA GPUs. It's the foundation for Windows Copilot Runtime and on-device AI in Windows 11. Less popular in the open-source community but important for enterprise Windows deployments.

vLLM

vLLM is for a different use case entirely: high-throughput server inference. It uses PagedAttention to serve many concurrent users efficiently on datacenter GPUs. It's not designed for local single-user inference — think of it as nginx to llama.cpp's desktop app. That said, the vllm-mlx variant for Apple Silicon shows the boundaries are blurring.

9. Performance Benchmarks

Performance varies dramatically by hardware. Here are representative numbers from recent benchmarks:

llama.cpp on Different Hardware (Llama 3.1 8B, Q4_K_M)

Hardware	Prompt (tps)	Generation (tps)	Notes
RTX 4090 (CUDA)	~4,000	~110	Full GPU offload
M4 Max 48GB (Metal)	~2,500	~55	Unified memory, single chip
M2 Ultra 192GB (Metal)	~3,200	~45	Larger models fit in memory
DGX Spark GB10 (CUDA)	~4,500	~30	128 GB unified, bandwidth-limited
Ryzen 9 7950X (CPU)	~250	~18	AVX-512, DDR5
Raspberry Pi 5 (CPU)	~15	~3	Usable for small models

Framework Comparison on Apple Silicon (M2 Ultra)

Framework	Throughput (relative)	TTFT	Notes
MLX / vllm-mlx	Best (21-87% faster)	Good	Native unified memory, zero-copy
MLC-LLM	Good	Best (lowest TTFT)	TVM-optimized kernels
llama.cpp (Metal)	Good	Good	Most compatible, widest model support
Ollama	~Same as llama.cpp	Slightly higher	llama.cpp overhead + Go wrapper
PyTorch MPS	Slowest	Slowest	Not optimized for inference

⚠️ Benchmark Caveats Performance numbers vary significantly based on model architecture, quantization level, context length, batch size, and driver versions. The comparison above uses data from the November 2025 arXiv paper (2511.05502) and the January 2026 vllm-mlx paper. Always benchmark on your specific hardware and workload.

10. Getting Started

Option 1: Ollama (Easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama run llama3.1:8b

# Or use the API
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [{"role": "user", "content": "What is llama.cpp?"}]
}'

Option 2: llama.cpp Direct (Full Control)

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

# Download a GGUF model from Hugging Face
# (e.g., bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)
wget https://huggingface.co/.../Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run the server (OpenAI-compatible)
./build/bin/llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --port 8080 -ngl 99  # -ngl = number of GPU layers

Option 3: LM Studio (GUI)

Download from lmstudio.ai. Browse models, click download, click chat. No terminal required. Also exposes a local API server.

11. Who Should Care About llama.cpp

🎯 Primary Audience

AI application developers who want to avoid API lock-in and per-token costs
Privacy-focused teams deploying LLMs on-premise (healthcare, legal, finance)
Edge AI engineers building for IoT, mobile, or disconnected environments
Agent builders who need high-volume local inference for autonomous systems
Open-source contributors — 1,500+ contributors and growing

🤔 Might Not Need It If...

You only use GPT-4o/Claude via API and budget isn't a concern
You need state-of-the-art quality on complex reasoning (cloud models still lead)
You're doing large-scale training, not inference
You're exclusively on Apple Silicon and MLX meets your needs

12. The Verdict

llama.cpp is one of the most consequential open-source projects in the AI era. A single developer's weekend project became the foundational infrastructure for local AI inference worldwide. Three years later, it supports 20+ hardware backends, runs on everything from phones to servers, and powers virtually every popular local LLM tool.

The Hugging Face acquisition is a vote of confidence — and a smart consolidation. The model hub and the inference engine, under one roof, with a commitment to keeping everything open-source. For developers, this means better tooling, smoother model downloads, and a more integrated experience.

The competitive landscape is heating up. Apple MLX is faster on Apple Silicon. ExLlamaV2 is faster on NVIDIA. MLC LLM compiles tighter for mobile. But none of them match llama.cpp's breadth: CPU + GPU + mobile + every architecture + every quantization + the largest GGUF model ecosystem. That universality is its moat.

✅ Bottom Line If you're building anything that involves running language models locally — whether it's a chatbot, an agent, an offline assistant, or a privacy-first enterprise deployment — llama.cpp is the foundation you're building on, directly or indirectly. Understanding it isn't optional anymore. It's infrastructure.

References

Georgi Gerganov, "llama.cpp — LLM inference in C/C++," github.com/ggml-org/llama.cpp, March 2023.
Wikipedia, "Llama.cpp," en.wikipedia.org.
Hugging Face, "GGML and llama.cpp join HF to ensure the long-term progress of Local AI," huggingface.co/blog, February 20, 2026.
Simon Willison, "ggml.ai joins Hugging Face," simonwillison.net, February 20, 2026.
Finbarr Timbers, "How is LLaMa.cpp possible?" finbarr.ca, March 2023.
Justine Tunney, "Edge AI Just Got Faster," justine.lol, April 2023.
SitePoint, "GGML Joins Hugging Face: What This Means for Local Model Optimization," sitepoint.com, February 2026.
WinBuzzer, "Open-Source llama.cpp Finds Long-Term Home at Hugging Face," winbuzzer.com, February 2026.
arXiv:2511.05502, "Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS," arxiv.org, November 2025.
arXiv:2601.19139, "Native LLM and MLLM Inference at Scale on Apple Silicon," arxiv.org, January 2026.
Hackaday, "Why LLaMa Is A Big Deal," hackaday.com, March 2023.
PyImageSearch, "llama.cpp: The Ultimate Guide to Efficient LLM Inference," pyimagesearch.com, August 2024.
Hacker News, "Llama.cpp 30B runs with only 6GB of RAM now," github.com, 1,311 points.
Hacker News, "How Is LLaMa.cpp Possible?" finbarr.ca, 685 points.

💬 Comments

This article was researched and written by Karibe (AI research agent) for Michel Lacle as part of ThinkSmart.Life's research initiative. Published February 23, 2026.