1. Why Quantization Matters
If you're building a local AI rig — maybe you just picked up an RTX 3090 — you've probably noticed a frustrating gap between the models you want to run and the models that actually fit on your GPU. The cutting-edge open models like Llama 3.1 70B, DeepSeek V3, or Mixtral 8x22B are staggeringly large. At full precision, a 70-billion-parameter model needs 140 GB of memory just to load. Your RTX 3090 has 24 GB.
Quantization is the bridge. It's the single most important technique that makes local AI practical on consumer hardware. Without it, you'd need $15,000+ worth of enterprise GPUs. With it, you can run powerful models on a single graphics card you bought for under $800.
This guide starts from absolute zero — no ML background required — and goes deep enough that you'll understand exactly what's happening to those numbers and why it works so well.
2. What Are Model Weights?
An AI language model is, at its core, a gigantic spreadsheet of numbers. These numbers are called weights (or parameters), and they encode everything the model learned during training — its vocabulary, its understanding of grammar, its knowledge of history, its ability to write code.
When someone says "Llama 3 70B," the "70B" means 70 billion weights. Each weight is a decimal number like 0.00347 or -1.28456. During inference (when the model generates text), these numbers are multiplied together in enormous matrix operations — billions of multiply-and-add operations for every single token the model produces.
How Weights Are Stored
Computers represent decimal numbers using floating-point formats. The standard formats are:
- FP32 (32-bit float) — 4 bytes per weight. The "full precision" standard. Extremely accurate, but huge.
- FP16 (16-bit float) — 2 bytes per weight. Half precision. What most models are distributed in today.
- BF16 (Brain Float 16) — 2 bytes per weight. Same size as FP16 but with a wider range. Google invented it specifically for ML training.
Think of it like image formats. FP32 is like a RAW photo from a DSLR — maximum quality, massive file. FP16 is like a high-quality TIFF — half the size, virtually identical to the eye. These are the "full precision" formats that models are trained in.
3. The Memory Problem
Here's the math that makes quantization essential:
| Model | Parameters | FP32 Size | FP16 Size | RTX 3090 (24 GB) |
|---|---|---|---|---|
| Llama 3.1 8B | 8 billion | 32 GB | 16 GB | ✅ Fits in FP16 |
| Mistral 7B | 7.3 billion | 29 GB | 14.6 GB | ✅ Fits in FP16 |
| Llama 3.1 70B | 70 billion | 280 GB | 140 GB | ❌ 5.8× too large |
| Mixtral 8x22B | 141 billion | 564 GB | 282 GB | ❌ 11.8× too large |
| DeepSeek V3 | 671 billion | 2,684 GB | 1,342 GB | ❌ 55.9× too large |
The formula is simple: bytes = parameters × bytes_per_weight. At FP16 (2 bytes each), a 70B model is 70 × 2 = 140 GB. Your GPU has 24 GB. The model literally doesn't fit.
4. What Quantization Does
Quantization reduces the numerical precision of each weight. Instead of storing every weight as a 16-bit float (2 bytes), you store it as an 8-bit integer (1 byte) or even a 4-bit integer (0.5 bytes). The model gets smaller proportionally.
The JPEG analogy: It's like compressing a photo from RAW to JPEG. You lose some data — but the image still looks great and takes a fraction of the space. A good JPEG at 90% quality is nearly indistinguishable from the RAW file, but it's 10× smaller. Quantization works the same way for AI models.
| Format | Bits per Weight | 70B Model Size | Fits on 24 GB? | Quality Loss |
|---|---|---|---|---|
| FP16 | 16 | 140 GB | ❌ No | None (baseline) |
| INT8 / Q8_0 | 8 | 70 GB | ❌ No | Virtually none |
| Q5_K_M | ~5.5 | ~48 GB | ❌ No | Very low |
| Q4_K_M | ~4.8 | ~42 GB | ⚠️ With offloading | Low |
| INT4 / Q4_0 | 4 | ~35 GB | ⚠️ With offloading | Moderate |
| Q2_K | ~2.7 | ~24 GB | ✅ Barely | Significant |
The magic is that modern quantization techniques lose almost nothing. The flagship recommendation for llama.cpp — Q4_K_M — stores weights in about 4.8 bits per weight and adds only +0.0535 perplexity compared to FP16 on a 7B model. That's a 63% size reduction with a quality loss so small you can't detect it in conversation.
5. Number Formats Explained
To understand quantization, you need to know how computers represent numbers. Here's every format you'll encounter:
Floating-Point Formats (Training & Full Precision)
| Format | Bits | Sign | Exponent | Mantissa | Range | Use Case |
|---|---|---|---|---|---|---|
| FP32 | 32 | 1 | 8 | 23 | ±3.4 × 10³⁸ | Training, full precision inference |
| FP16 | 16 | 1 | 5 | 10 | ±65,504 | Model distribution, GPU inference |
| BF16 | 16 | 1 | 8 | 7 | ±3.4 × 10³⁸ | Training (same range as FP32, less precision) |
FP32 is the gold standard — 23 bits of mantissa (precision) and 8 bits of exponent (range). It can represent incredibly precise values across a huge range. FP16 cuts both in half: less precision and a much smaller range (max value ~65,000). BF16 is a clever compromise — it keeps FP32's exponent (same range) but reduces precision, making it ideal for training where range matters more than precision.
Integer Formats (Quantized)
| Format | Bits | Values | Typical Use |
|---|---|---|---|
| INT8 | 8 | 256 distinct levels | High-quality quantization, almost lossless |
| INT4 | 4 | 16 distinct levels | The sweet spot for local AI |
| INT2 | 2 | 4 distinct levels | Extreme compression, significant quality loss |
| NF4 | 4 | 16 levels (non-uniform) | BitsAndBytes — levels match normal distribution |
The key difference: floating-point formats have variable precision (more precise near zero, less precise for large values), while integer formats have uniform steps. Going from FP16's 65,536 possible values to INT4's 16 values sounds catastrophic — but it works because quantization algorithms are smart about how they map values.
How Quantization Mapping Works
The simplest quantization is absmax quantization: find the maximum absolute value in a group of weights, divide all weights by that value to normalize them to [-1, 1], then multiply by 127 (for INT8) to map them to integers. To use the weights, you reverse the process (dequantize). The error comes from rounding — multiple float values map to the same integer.
Modern methods go further. Group quantization divides weights into small groups (32-128 weights each) and computes a separate scale factor per group, reducing error. K-quants (used in GGUF) use mixed precision — more important layers get higher precision (Q6_K) while less critical layers get lower precision (Q4_K). This is why Q4_K_M outperforms plain Q4_0 despite being similar size.
6. GGUF & llama.cpp Quantization Levels
llama.cpp is the most popular tool for running quantized models locally. It uses the GGUF (GPT-Generated Unified Format) file format, which stores the quantized weights along with metadata (tokenizer, architecture info, etc.) in a single file.
GGUF replaced the older GGML format in August 2023, adding better metadata support and forward compatibility. It's now the universal standard for local AI — supported by llama.cpp, Ollama, LM Studio, Jan, and more.
The Complete GGUF Quantization Table
Here's every GGUF quantization level for a 7B parameter model, from llama.cpp's official benchmarks:
| Quant Type | Size (7B) | PPL Increase | Quality | Recommendation |
|---|---|---|---|---|
| F32 | 26.00 GB | +0.0000 | Lossless | ❌ Not recommended — too large |
| F16 | 13.00 GB | ~+0.0000 | Virtually lossless | ❌ Not recommended — too large |
| Q8_0 | 6.70 GB | +0.0004 | Indistinguishable from F16 | Use if you have VRAM to spare |
| Q6_K | 5.15 GB | +0.0044 | Extremely low loss | Best quality-per-bit |
| Q5_K_M | 4.45 GB | +0.0142 | Very low loss | ⭐ Recommended |
| Q5_K_S | 4.33 GB | +0.0353 | Low loss | ⭐ Recommended |
| Q4_K_M | 3.80 GB | +0.0535 | Balanced | ⭐ Recommended — best all-rounder |
| Q4_K_S | 3.56 GB | +0.1149 | Noticeable loss | OK for tight VRAM |
| Q3_K_L | 3.35 GB | +0.1803 | Substantial loss | Only if needed |
| Q3_K_M | 3.06 GB | +0.2437 | High quality loss | ⚠️ Noticeable degradation |
| Q3_K_S | 2.75 GB | +0.5505 | Very high loss | ⚠️ Not recommended |
| Q2_K | 2.67 GB | +0.8698 | Extreme loss | ❌ Last resort |
PPL (perplexity) measures how "surprised" the model is by text. Lower is better. The increase is relative to unquantized F16. A +0.05 increase is barely detectable; +0.25 starts to show in reasoning quality; +0.50 and above is clearly degraded.
What the Letters Mean
- Q = Quantized, followed by the bit depth (Q4 = ~4 bits, Q5 = ~5 bits)
- K = K-quant (mixed precision — smarter allocation across layers)
- _S / _M / _L = Small / Medium / Large — controls how many layers get higher precision. _M uses Q6_K for half of attention and feed-forward weights; _S uses Q4_K everywhere; _L uses Q6_K for more layers
7. GPTQ vs AWQ vs EXL2
GGUF isn't the only quantization game in town. There are three major GPU-focused quantization methods, each with different tradeoffs:
| Method | Format | Target | Speed | Quality | Best For |
|---|---|---|---|---|---|
| GGUF | .gguf | CPU + GPU | Good (great with GPU offload) | Excellent (K-quants) | Local inference, mixed CPU/GPU, Ollama, LM Studio |
| GPTQ | .safetensors | GPU only | Fast (with Marlin kernel) | Good | GPU-only inference, vLLM, TGI servers |
| AWQ | .safetensors | GPU only | Very fast (with Marlin) | Very good | Production serving, best speed-quality ratio on GPU |
| EXL2 | .safetensors | GPU only | Fastest | Excellent | Maximum inference speed, ExLlamaV2 |
GPTQ (GPT Quantization)
Created by Frantar et al. (2022), GPTQ was one of the first practical post-training quantization methods for LLMs. It works by quantizing weights one layer at a time, using a small calibration dataset (typically 128 samples of text) to minimize the quantization error. The key innovation is using the inverse Hessian matrix to determine which weights are most important and should be quantized more carefully.
GPTQ models are GPU-only and require CUDA. They shine when paired with the Marlin kernel — a highly optimized CUDA kernel that can achieve 741 tok/s (compared to 276 tok/s without it, based on JarvisLabs benchmarks). Without Marlin, GPTQ can actually be slower than FP16.
AWQ (Activation-Aware Weight Quantization)
AWQ, developed by Lin et al. (2023) at MIT, takes a different approach. Instead of looking at the weights themselves, it analyzes the activations — the intermediate values during inference — to determine which weights matter most. Weights that produce large activations are kept at higher precision.
AWQ tends to retain slightly better quality than GPTQ at the same bit width, especially for instruction-following and chat tasks. Like GPTQ, it benefits enormously from optimized kernels (Marlin) and is primarily GPU-focused.
EXL2 (ExLlamaV2)
ExLlamaV2, created by turboderp, is designed purely for maximum GPU inference speed. Its EXL2 format supports variable bit-width quantization — you can quantize to any target bits-per-weight (e.g., 3.5, 4.25, 5.0) and the algorithm automatically distributes precision across layers based on their sensitivity.
EXL2 is often the fastest option for single-GPU inference and produces excellent quality, but it has a smaller ecosystem (primarily used through the ExLlamaV2 library or TabbyAPI).
When to Use Which
🏠 Use GGUF When...
- You're using Ollama, LM Studio, or Jan
- You want to split between CPU and GPU (partial offload)
- You're running on Mac (Apple Silicon)
- You want the simplest setup
- You need to run on CPU-only systems
🖥️ Use GPTQ/AWQ/EXL2 When...
- The entire model fits in GPU VRAM
- You want maximum inference speed
- You're running a production API server (vLLM, TGI)
- You have NVIDIA GPU with CUDA support
- You're willing to use more specialized tools
8. BitsAndBytes & Emerging Methods
BitsAndBytes (bnb)
BitsAndBytes, created by Tim Dettmers, is the go-to library for on-the-fly quantization in the HuggingFace ecosystem. Unlike GPTQ/AWQ/GGUF which pre-quantize the model into a file, BitsAndBytes quantizes at load time — you pass a config flag and the model loads in 4-bit or 8-bit directly.
It uses NF4 (4-bit NormalFloat), a clever data type where the 16 quantization levels are spaced according to a normal distribution. Since neural network weights follow a roughly normal distribution, this means common values (near zero) get higher precision and rare extreme values get less. BitsAndBytes also supports double quantization — quantizing the quantization constants themselves, saving additional memory.
Pros: Zero-effort quantization (just add a config flag), no calibration data needed, works with any HuggingFace model, supports QLoRA for fine-tuning. Cons: Slower than pre-quantized formats (GPTQ/AWQ with Marlin kernels), GPU-only.
EETQ (Easy and Efficient Transformer Quantization)
EETQ provides INT8 weight-only quantization with no calibration required. It's simpler than GPTQ/AWQ but limited to 8-bit. Useful when you want a quick quality-preserving size reduction without the complexity of calibrated quantization.
HQQ (Half-Quadratic Quantization)
HQQ is a newer method that achieves calibration-free quantization by optimizing weights directly. It's faster to quantize than GPTQ/AWQ (no calibration dataset needed) and achieves competitive quality at 4-bit. It's gaining traction as an alternative when you need to quantize models quickly.
Importance Matrix (imatrix) Quantization
A recent innovation in the GGUF ecosystem. An importance matrix is computed by running calibration data through the model to identify which weights contribute most to the output. This matrix is then used during quantization to allocate more precision to important weights. GGUF models quantized with imatrix (look for "imatrix" in the filename on HuggingFace) tend to be noticeably better at lower bit widths (Q3, Q2) compared to standard quantization.
9. Real Benchmarks: Quality vs. Speed vs. VRAM
Here's what actually happens when you quantize a model. These numbers are based on community benchmarks from the LocalLLaMA community, llama.cpp's official tests, and JarvisLabs' vLLM benchmarks:
Quality (Perplexity) — Llama 3 7B
| Quant | BPW | Size | PPL | PPL vs F16 | Quality Verdict |
|---|---|---|---|---|---|
| F16 | 16.0 | 13.0 GB | baseline | +0.000 | Reference |
| Q8_0 | 8.0 | 6.7 GB | ≈ baseline | +0.0004 | Identical in practice |
| Q6_K | 6.6 | 5.2 GB | ≈ baseline | +0.004 | Virtually identical |
| Q5_K_M | 5.7 | 4.5 GB | slight ↑ | +0.014 | No detectable difference |
| Q4_K_M | 4.8 | 3.8 GB | slight ↑ | +0.054 | Negligible — the sweet spot |
| Q3_K_M | 3.9 | 3.1 GB | moderate ↑ | +0.244 | Noticeable in complex reasoning |
| Q2_K | 2.7 | 2.7 GB | large ↑ | +0.870 | Clearly degraded |
Speed — Tokens per Second (Single RTX 3090)
Lower-bit quants are generally faster because they require less memory bandwidth — which is the primary bottleneck during inference. Typical single-GPU generation speeds on an RTX 3090 for a 7B model:
- Q4_K_M: ~80-120 tok/s
- Q5_K_M: ~70-100 tok/s
- Q8_0: ~50-80 tok/s
- F16: ~40-60 tok/s
For larger models that require CPU offloading (partial GPU), speeds drop dramatically — a 70B Q4_K_M with half the layers on CPU might give 5-15 tok/s. This is why fitting the entire model in VRAM matters so much.
Quality vs. Method (4-bit Comparison)
| Method | Format | Calibration? | Quality (MMLU) | Speed |
|---|---|---|---|---|
| GGUF Q4_K_M | .gguf | No (imatrix optional) | ~95-97% of F16 | Good |
| AWQ 4-bit | .safetensors | Yes | ~95% of F16 | Very fast (w/ Marlin) |
| GPTQ 4-bit | .safetensors | Yes | ~90-93% of F16 | Fast (w/ Marlin) |
| EXL2 4.0bpw | .safetensors | Yes | ~95-97% of F16 | Fastest |
| BnB NF4 | in-memory | No | ~93-95% of F16 | Moderate |
10. What Fits on a 24 GB RTX 3090?
This is the practical section. You've got 24 GB of VRAM. Here's exactly what you can run:
| Model | Quant | Size | Fits in 24 GB? | Speed | Quality |
|---|---|---|---|---|---|
| Llama 3.1 8B | Q8_0 | 8.5 GB | ✅ Easily | ~60-80 tok/s | ★★★★★ |
| Llama 3.1 8B | Q4_K_M | 4.9 GB | ✅ Easily | ~90-120 tok/s | ★★★★½ |
| Mistral Nemo 12B | Q5_K_M | 8.7 GB | ✅ Easily | ~50-70 tok/s | ★★★★★ |
| Qwen 2.5 14B | Q4_K_M | 8.9 GB | ✅ Yes | ~40-60 tok/s | ★★★★½ |
| Codestral 22B | Q4_K_M | 13.2 GB | ✅ Yes | ~30-45 tok/s | ★★★★½ |
| Llama 3.3 70B | Q2_K | ~25 GB | ⚠️ Tight (needs offload) | ~5-10 tok/s | ★★★ |
| Llama 3.3 70B | Q4_K_M | ~42 GB | ❌ No (needs ~2× 3090) | — | — |
| DeepSeek-R1 (distill) 14B | Q4_K_M | ~8.9 GB | ✅ Yes | ~40-55 tok/s | ★★★★½ |
The Sweet Spots for 24 GB
🎯 Best Quality
- 7-8B models at Q8_0 or Q6_K — near-lossless quality
- 12-14B models at Q5_K_M — excellent quality, fits comfortably
- 22B models at Q4_K_M — great balance
🚀 Maximum Capability
- 32-34B models at Q3_K_M to Q4_K_S — pushing the limits
- 70B models at Q2_K with offloading — slow but possible
- MoE models (Mixtral 8x7B at Q4_K_M) — only active experts need VRAM
11. How to Choose Your Quantization
Here's a simple decision tree:
- Calculate model size at your target quant. Rule of thumb:
size_GB ≈ parameters_B × bits_per_weight / 8. A 14B model at Q4_K_M (~4.8 bpw) = 14 × 4.8 / 8 ≈ 8.4 GB. - Add 2-4 GB for KV cache and overhead. So that 8.4 GB model needs ~11-12 GB total VRAM.
- Does it fit? If yes, use the highest quality quant that fits. If not, either pick a smaller model or a lower quant.
- Pick the right format:
- Using Ollama/LM Studio/Jan? → GGUF (Q4_K_M or Q5_K_M)
- Running vLLM/TGI API server? → AWQ or GPTQ
- Maximum speed, single GPU? → EXL2 via ExLlamaV2
- Fine-tuning with QLoRA? → BitsAndBytes NF4
- Always prefer a smaller model at higher quant over a larger model at lower quant. A 14B Q5_K_M will generally outperform a 70B Q2_K — the extreme compression destroys too much quality.
12. Getting Started
With Ollama (Easiest)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run a quantized model (Ollama handles GGUF automatically)
ollama run llama3.1:8b # Default quant (Q4_K_M)
ollama run llama3.1:8b-q8_0 # Higher quality quant
ollama run qwen2.5:14b # 14B model, Q4_K_M
ollama run deepseek-r1:14b # DeepSeek R1 distill
With llama.cpp (More Control)
# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j GGML_CUDA=1
# Download a GGUF from HuggingFace (e.g., from TheBloke or bartowski)
# Then run:
./llama-server -m model-Q4_K_M.gguf -ngl 99 -c 4096
# -ngl 99 = offload all layers to GPU
# -c 4096 = context length
With LM Studio (GUI)
Download LM Studio, search for any model, and it shows available GGUF quants with size estimates. Click download, click run. It automatically detects your GPU and offloads as many layers as possible.
Quantizing Your Own Model
# Using llama.cpp's quantize tool
./llama-quantize input-model-f16.gguf output-Q4_K_M.gguf Q4_K_M
# With importance matrix (better quality at low bits)
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
./llama-quantize --imatrix imatrix.dat model-f16.gguf output-Q4_K_M.gguf Q4_K_M
References
- Frantar, E., et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," arXiv:2210.17323, 2022.
- Lin, J., et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," arXiv:2306.00978, 2023.
- Dettmers, T., et al., "QLoRA: Efficient Finetuning of Quantized Language Models," arXiv:2305.14314, 2023.
- ggml-org, "llama.cpp Quantization Types," GitHub Discussion #2094.
- Maarten Grootendorst, "Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)," Substack, 2023.
- turboderp, "ExLlamaV2," GitHub.
- Tim Dettmers, "BitsAndBytes," GitHub.
- JarvisLabs, "The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices," JarvisLabs Docs, 2026.
- Ionio AI, "LLMs on CPU: The Power of Quantization with GGUF, AWQ, & GPTQ," ionio.ai.
- Hardware Corner, "Quantization for Local LLMs: How It Works and Which Formats Fit Your Setup," hardware-corner.net, 2025.
- LocalLLM.in, "The Complete Guide to LLM Quantization," localllm.in, 2025.
- matt-c1, "Llama 3 Quantization Comparison," GitHub.
- NVIDIA, "RTX 3090 Specifications," nvidia.com.
This article was written collaboratively by Michel (human) and Karibe (AI research agent) as part of ThinkSmart.Life's research initiative. Data reflects February 2026 benchmarks and may evolve as new quantization methods emerge.
💬 Comments