📺 Watch the Video Prefer video? Watch the full briefing at our Video Library.
🎧 Listen

1. Why Quantization Matters

If you're building a local AI rig — maybe you just picked up an RTX 3090 — you've probably noticed a frustrating gap between the models you want to run and the models that actually fit on your GPU. The cutting-edge open models like Llama 3.1 70B, DeepSeek V3, or Mixtral 8x22B are staggeringly large. At full precision, a 70-billion-parameter model needs 140 GB of memory just to load. Your RTX 3090 has 24 GB.

Quantization is the bridge. It's the single most important technique that makes local AI practical on consumer hardware. Without it, you'd need $15,000+ worth of enterprise GPUs. With it, you can run powerful models on a single graphics card you bought for under $800.

This guide starts from absolute zero — no ML background required — and goes deep enough that you'll understand exactly what's happening to those numbers and why it works so well.

2. What Are Model Weights?

An AI language model is, at its core, a gigantic spreadsheet of numbers. These numbers are called weights (or parameters), and they encode everything the model learned during training — its vocabulary, its understanding of grammar, its knowledge of history, its ability to write code.

When someone says "Llama 3 70B," the "70B" means 70 billion weights. Each weight is a decimal number like 0.00347 or -1.28456. During inference (when the model generates text), these numbers are multiplied together in enormous matrix operations — billions of multiply-and-add operations for every single token the model produces.

How Weights Are Stored

Computers represent decimal numbers using floating-point formats. The standard formats are:

Think of it like image formats. FP32 is like a RAW photo from a DSLR — maximum quality, massive file. FP16 is like a high-quality TIFF — half the size, virtually identical to the eye. These are the "full precision" formats that models are trained in.

3. The Memory Problem

Here's the math that makes quantization essential:

Model Parameters FP32 Size FP16 Size RTX 3090 (24 GB)
Llama 3.1 8B 8 billion 32 GB 16 GB ✅ Fits in FP16
Mistral 7B 7.3 billion 29 GB 14.6 GB ✅ Fits in FP16
Llama 3.1 70B 70 billion 280 GB 140 GB ❌ 5.8× too large
Mixtral 8x22B 141 billion 564 GB 282 GB ❌ 11.8× too large
DeepSeek V3 671 billion 2,684 GB 1,342 GB ❌ 55.9× too large

The formula is simple: bytes = parameters × bytes_per_weight. At FP16 (2 bytes each), a 70B model is 70 × 2 = 140 GB. Your GPU has 24 GB. The model literally doesn't fit.

⚠️ VRAM is the bottleneck Unlike system RAM, you can't just add more VRAM. The RTX 3090 has 24 GB soldered to the board. That's it. The only way to fit larger models is to make the model smaller — which is exactly what quantization does.

4. What Quantization Does

Quantization reduces the numerical precision of each weight. Instead of storing every weight as a 16-bit float (2 bytes), you store it as an 8-bit integer (1 byte) or even a 4-bit integer (0.5 bytes). The model gets smaller proportionally.

The JPEG analogy: It's like compressing a photo from RAW to JPEG. You lose some data — but the image still looks great and takes a fraction of the space. A good JPEG at 90% quality is nearly indistinguishable from the RAW file, but it's 10× smaller. Quantization works the same way for AI models.

Format Bits per Weight 70B Model Size Fits on 24 GB? Quality Loss
FP16 16 140 GB ❌ No None (baseline)
INT8 / Q8_0 8 70 GB ❌ No Virtually none
Q5_K_M ~5.5 ~48 GB ❌ No Very low
Q4_K_M ~4.8 ~42 GB ⚠️ With offloading Low
INT4 / Q4_0 4 ~35 GB ⚠️ With offloading Moderate
Q2_K ~2.7 ~24 GB ✅ Barely Significant

The magic is that modern quantization techniques lose almost nothing. The flagship recommendation for llama.cpp — Q4_K_M — stores weights in about 4.8 bits per weight and adds only +0.0535 perplexity compared to FP16 on a 7B model. That's a 63% size reduction with a quality loss so small you can't detect it in conversation.

✅ The key insight Neural network weights are surprisingly redundant. Most weights cluster around zero and don't need high precision. Quantization exploits this statistical property — it gives more precision to the values that matter and less to those that don't.

5. Number Formats Explained

To understand quantization, you need to know how computers represent numbers. Here's every format you'll encounter:

Floating-Point Formats (Training & Full Precision)

Format Bits Sign Exponent Mantissa Range Use Case
FP32 32 1 8 23 ±3.4 × 10³⁸ Training, full precision inference
FP16 16 1 5 10 ±65,504 Model distribution, GPU inference
BF16 16 1 8 7 ±3.4 × 10³⁸ Training (same range as FP32, less precision)

FP32 is the gold standard — 23 bits of mantissa (precision) and 8 bits of exponent (range). It can represent incredibly precise values across a huge range. FP16 cuts both in half: less precision and a much smaller range (max value ~65,000). BF16 is a clever compromise — it keeps FP32's exponent (same range) but reduces precision, making it ideal for training where range matters more than precision.

Integer Formats (Quantized)

Format Bits Values Typical Use
INT8 8 256 distinct levels High-quality quantization, almost lossless
INT4 4 16 distinct levels The sweet spot for local AI
INT2 2 4 distinct levels Extreme compression, significant quality loss
NF4 4 16 levels (non-uniform) BitsAndBytes — levels match normal distribution

The key difference: floating-point formats have variable precision (more precise near zero, less precise for large values), while integer formats have uniform steps. Going from FP16's 65,536 possible values to INT4's 16 values sounds catastrophic — but it works because quantization algorithms are smart about how they map values.

How Quantization Mapping Works

The simplest quantization is absmax quantization: find the maximum absolute value in a group of weights, divide all weights by that value to normalize them to [-1, 1], then multiply by 127 (for INT8) to map them to integers. To use the weights, you reverse the process (dequantize). The error comes from rounding — multiple float values map to the same integer.

Modern methods go further. Group quantization divides weights into small groups (32-128 weights each) and computes a separate scale factor per group, reducing error. K-quants (used in GGUF) use mixed precision — more important layers get higher precision (Q6_K) while less critical layers get lower precision (Q4_K). This is why Q4_K_M outperforms plain Q4_0 despite being similar size.

6. GGUF & llama.cpp Quantization Levels

llama.cpp is the most popular tool for running quantized models locally. It uses the GGUF (GPT-Generated Unified Format) file format, which stores the quantized weights along with metadata (tokenizer, architecture info, etc.) in a single file.

GGUF replaced the older GGML format in August 2023, adding better metadata support and forward compatibility. It's now the universal standard for local AI — supported by llama.cpp, Ollama, LM Studio, Jan, and more.

The Complete GGUF Quantization Table

Here's every GGUF quantization level for a 7B parameter model, from llama.cpp's official benchmarks:

Quant Type Size (7B) PPL Increase Quality Recommendation
F32 26.00 GB +0.0000 Lossless ❌ Not recommended — too large
F16 13.00 GB ~+0.0000 Virtually lossless ❌ Not recommended — too large
Q8_0 6.70 GB +0.0004 Indistinguishable from F16 Use if you have VRAM to spare
Q6_K 5.15 GB +0.0044 Extremely low loss Best quality-per-bit
Q5_K_M 4.45 GB +0.0142 Very low loss ⭐ Recommended
Q5_K_S 4.33 GB +0.0353 Low loss ⭐ Recommended
Q4_K_M 3.80 GB +0.0535 Balanced ⭐ Recommended — best all-rounder
Q4_K_S 3.56 GB +0.1149 Noticeable loss OK for tight VRAM
Q3_K_L 3.35 GB +0.1803 Substantial loss Only if needed
Q3_K_M 3.06 GB +0.2437 High quality loss ⚠️ Noticeable degradation
Q3_K_S 2.75 GB +0.5505 Very high loss ⚠️ Not recommended
Q2_K 2.67 GB +0.8698 Extreme loss ❌ Last resort

PPL (perplexity) measures how "surprised" the model is by text. Lower is better. The increase is relative to unquantized F16. A +0.05 increase is barely detectable; +0.25 starts to show in reasoning quality; +0.50 and above is clearly degraded.

What the Letters Mean

💡 The sweet spot: Q4_K_M Q4_K_M is the most-recommended quant for a reason. At 3.80 GB for a 7B model, it's small enough to fit large models on consumer GPUs, and the +0.0535 perplexity increase means you won't notice any quality difference in normal use. It achieves this by using Q6_K (6-bit) for the most important weight matrices (attention value projections and feed-forward second layers) and Q4_K (4-bit) for everything else.

7. GPTQ vs AWQ vs EXL2

GGUF isn't the only quantization game in town. There are three major GPU-focused quantization methods, each with different tradeoffs:

Method Format Target Speed Quality Best For
GGUF .gguf CPU + GPU Good (great with GPU offload) Excellent (K-quants) Local inference, mixed CPU/GPU, Ollama, LM Studio
GPTQ .safetensors GPU only Fast (with Marlin kernel) Good GPU-only inference, vLLM, TGI servers
AWQ .safetensors GPU only Very fast (with Marlin) Very good Production serving, best speed-quality ratio on GPU
EXL2 .safetensors GPU only Fastest Excellent Maximum inference speed, ExLlamaV2

GPTQ (GPT Quantization)

Created by Frantar et al. (2022), GPTQ was one of the first practical post-training quantization methods for LLMs. It works by quantizing weights one layer at a time, using a small calibration dataset (typically 128 samples of text) to minimize the quantization error. The key innovation is using the inverse Hessian matrix to determine which weights are most important and should be quantized more carefully.

GPTQ models are GPU-only and require CUDA. They shine when paired with the Marlin kernel — a highly optimized CUDA kernel that can achieve 741 tok/s (compared to 276 tok/s without it, based on JarvisLabs benchmarks). Without Marlin, GPTQ can actually be slower than FP16.

AWQ (Activation-Aware Weight Quantization)

AWQ, developed by Lin et al. (2023) at MIT, takes a different approach. Instead of looking at the weights themselves, it analyzes the activations — the intermediate values during inference — to determine which weights matter most. Weights that produce large activations are kept at higher precision.

AWQ tends to retain slightly better quality than GPTQ at the same bit width, especially for instruction-following and chat tasks. Like GPTQ, it benefits enormously from optimized kernels (Marlin) and is primarily GPU-focused.

EXL2 (ExLlamaV2)

ExLlamaV2, created by turboderp, is designed purely for maximum GPU inference speed. Its EXL2 format supports variable bit-width quantization — you can quantize to any target bits-per-weight (e.g., 3.5, 4.25, 5.0) and the algorithm automatically distributes precision across layers based on their sensitivity.

EXL2 is often the fastest option for single-GPU inference and produces excellent quality, but it has a smaller ecosystem (primarily used through the ExLlamaV2 library or TabbyAPI).

When to Use Which

🏠 Use GGUF When...

  • You're using Ollama, LM Studio, or Jan
  • You want to split between CPU and GPU (partial offload)
  • You're running on Mac (Apple Silicon)
  • You want the simplest setup
  • You need to run on CPU-only systems

🖥️ Use GPTQ/AWQ/EXL2 When...

  • The entire model fits in GPU VRAM
  • You want maximum inference speed
  • You're running a production API server (vLLM, TGI)
  • You have NVIDIA GPU with CUDA support
  • You're willing to use more specialized tools

8. BitsAndBytes & Emerging Methods

BitsAndBytes (bnb)

BitsAndBytes, created by Tim Dettmers, is the go-to library for on-the-fly quantization in the HuggingFace ecosystem. Unlike GPTQ/AWQ/GGUF which pre-quantize the model into a file, BitsAndBytes quantizes at load time — you pass a config flag and the model loads in 4-bit or 8-bit directly.

It uses NF4 (4-bit NormalFloat), a clever data type where the 16 quantization levels are spaced according to a normal distribution. Since neural network weights follow a roughly normal distribution, this means common values (near zero) get higher precision and rare extreme values get less. BitsAndBytes also supports double quantization — quantizing the quantization constants themselves, saving additional memory.

Pros: Zero-effort quantization (just add a config flag), no calibration data needed, works with any HuggingFace model, supports QLoRA for fine-tuning. Cons: Slower than pre-quantized formats (GPTQ/AWQ with Marlin kernels), GPU-only.

EETQ (Easy and Efficient Transformer Quantization)

EETQ provides INT8 weight-only quantization with no calibration required. It's simpler than GPTQ/AWQ but limited to 8-bit. Useful when you want a quick quality-preserving size reduction without the complexity of calibrated quantization.

HQQ (Half-Quadratic Quantization)

HQQ is a newer method that achieves calibration-free quantization by optimizing weights directly. It's faster to quantize than GPTQ/AWQ (no calibration dataset needed) and achieves competitive quality at 4-bit. It's gaining traction as an alternative when you need to quantize models quickly.

Importance Matrix (imatrix) Quantization

A recent innovation in the GGUF ecosystem. An importance matrix is computed by running calibration data through the model to identify which weights contribute most to the output. This matrix is then used during quantization to allocate more precision to important weights. GGUF models quantized with imatrix (look for "imatrix" in the filename on HuggingFace) tend to be noticeably better at lower bit widths (Q3, Q2) compared to standard quantization.

9. Real Benchmarks: Quality vs. Speed vs. VRAM

Here's what actually happens when you quantize a model. These numbers are based on community benchmarks from the LocalLLaMA community, llama.cpp's official tests, and JarvisLabs' vLLM benchmarks:

Quality (Perplexity) — Llama 3 7B

Quant BPW Size PPL PPL vs F16 Quality Verdict
F16 16.0 13.0 GB baseline +0.000 Reference
Q8_0 8.0 6.7 GB ≈ baseline +0.0004 Identical in practice
Q6_K 6.6 5.2 GB ≈ baseline +0.004 Virtually identical
Q5_K_M 5.7 4.5 GB slight ↑ +0.014 No detectable difference
Q4_K_M 4.8 3.8 GB slight ↑ +0.054 Negligible — the sweet spot
Q3_K_M 3.9 3.1 GB moderate ↑ +0.244 Noticeable in complex reasoning
Q2_K 2.7 2.7 GB large ↑ +0.870 Clearly degraded

Speed — Tokens per Second (Single RTX 3090)

Lower-bit quants are generally faster because they require less memory bandwidth — which is the primary bottleneck during inference. Typical single-GPU generation speeds on an RTX 3090 for a 7B model:

For larger models that require CPU offloading (partial GPU), speeds drop dramatically — a 70B Q4_K_M with half the layers on CPU might give 5-15 tok/s. This is why fitting the entire model in VRAM matters so much.

Quality vs. Method (4-bit Comparison)

Method Format Calibration? Quality (MMLU) Speed
GGUF Q4_K_M .gguf No (imatrix optional) ~95-97% of F16 Good
AWQ 4-bit .safetensors Yes ~95% of F16 Very fast (w/ Marlin)
GPTQ 4-bit .safetensors Yes ~90-93% of F16 Fast (w/ Marlin)
EXL2 4.0bpw .safetensors Yes ~95-97% of F16 Fastest
BnB NF4 in-memory No ~93-95% of F16 Moderate

10. What Fits on a 24 GB RTX 3090?

This is the practical section. You've got 24 GB of VRAM. Here's exactly what you can run:

💡 Rule of thumb Reserve ~2-4 GB of VRAM for the KV cache (context window) and inference overhead. So you effectively have ~20-22 GB for model weights.
Model Quant Size Fits in 24 GB? Speed Quality
Llama 3.1 8B Q8_0 8.5 GB ✅ Easily ~60-80 tok/s ★★★★★
Llama 3.1 8B Q4_K_M 4.9 GB ✅ Easily ~90-120 tok/s ★★★★½
Mistral Nemo 12B Q5_K_M 8.7 GB ✅ Easily ~50-70 tok/s ★★★★★
Qwen 2.5 14B Q4_K_M 8.9 GB ✅ Yes ~40-60 tok/s ★★★★½
Codestral 22B Q4_K_M 13.2 GB ✅ Yes ~30-45 tok/s ★★★★½
Llama 3.3 70B Q2_K ~25 GB ⚠️ Tight (needs offload) ~5-10 tok/s ★★★
Llama 3.3 70B Q4_K_M ~42 GB ❌ No (needs ~2× 3090)
DeepSeek-R1 (distill) 14B Q4_K_M ~8.9 GB ✅ Yes ~40-55 tok/s ★★★★½

The Sweet Spots for 24 GB

🎯 Best Quality

  • 7-8B models at Q8_0 or Q6_K — near-lossless quality
  • 12-14B models at Q5_K_M — excellent quality, fits comfortably
  • 22B models at Q4_K_M — great balance

🚀 Maximum Capability

  • 32-34B models at Q3_K_M to Q4_K_S — pushing the limits
  • 70B models at Q2_K with offloading — slow but possible
  • MoE models (Mixtral 8x7B at Q4_K_M) — only active experts need VRAM
✅ Recommendation for Michel's RTX 3090 Start with Llama 3.1 8B at Q8_0 or Qwen 2.5 14B at Q4_K_M. Both fit easily in 24 GB with room for large context windows. Use Ollama or LM Studio for the easiest setup — they handle GGUF files natively. As you get comfortable, try the 22B and 32B class models at Q4_K_M to push your GPU.

11. How to Choose Your Quantization

Here's a simple decision tree:

  1. Calculate model size at your target quant. Rule of thumb: size_GB ≈ parameters_B × bits_per_weight / 8. A 14B model at Q4_K_M (~4.8 bpw) = 14 × 4.8 / 8 ≈ 8.4 GB.
  2. Add 2-4 GB for KV cache and overhead. So that 8.4 GB model needs ~11-12 GB total VRAM.
  3. Does it fit? If yes, use the highest quality quant that fits. If not, either pick a smaller model or a lower quant.
  4. Pick the right format:
    • Using Ollama/LM Studio/Jan? → GGUF (Q4_K_M or Q5_K_M)
    • Running vLLM/TGI API server? → AWQ or GPTQ
    • Maximum speed, single GPU? → EXL2 via ExLlamaV2
    • Fine-tuning with QLoRA? → BitsAndBytes NF4
  5. Always prefer a smaller model at higher quant over a larger model at lower quant. A 14B Q5_K_M will generally outperform a 70B Q2_K — the extreme compression destroys too much quality.

12. Getting Started

With Ollama (Easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run a quantized model (Ollama handles GGUF automatically)
ollama run llama3.1:8b          # Default quant (Q4_K_M)
ollama run llama3.1:8b-q8_0     # Higher quality quant
ollama run qwen2.5:14b          # 14B model, Q4_K_M
ollama run deepseek-r1:14b      # DeepSeek R1 distill

With llama.cpp (More Control)

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j GGML_CUDA=1

# Download a GGUF from HuggingFace (e.g., from TheBloke or bartowski)
# Then run:
./llama-server -m model-Q4_K_M.gguf -ngl 99 -c 4096

# -ngl 99 = offload all layers to GPU
# -c 4096 = context length

With LM Studio (GUI)

Download LM Studio, search for any model, and it shows available GGUF quants with size estimates. Click download, click run. It automatically detects your GPU and offloads as many layers as possible.

Quantizing Your Own Model

# Using llama.cpp's quantize tool
./llama-quantize input-model-f16.gguf output-Q4_K_M.gguf Q4_K_M

# With importance matrix (better quality at low bits)
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
./llama-quantize --imatrix imatrix.dat model-f16.gguf output-Q4_K_M.gguf Q4_K_M

References

  1. Frantar, E., et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," arXiv:2210.17323, 2022.
  2. Lin, J., et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," arXiv:2306.00978, 2023.
  3. Dettmers, T., et al., "QLoRA: Efficient Finetuning of Quantized Language Models," arXiv:2305.14314, 2023.
  4. ggml-org, "llama.cpp Quantization Types," GitHub Discussion #2094.
  5. Maarten Grootendorst, "Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)," Substack, 2023.
  6. turboderp, "ExLlamaV2," GitHub.
  7. Tim Dettmers, "BitsAndBytes," GitHub.
  8. JarvisLabs, "The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices," JarvisLabs Docs, 2026.
  9. Ionio AI, "LLMs on CPU: The Power of Quantization with GGUF, AWQ, & GPTQ," ionio.ai.
  10. Hardware Corner, "Quantization for Local LLMs: How It Works and Which Formats Fit Your Setup," hardware-corner.net, 2025.
  11. LocalLLM.in, "The Complete Guide to LLM Quantization," localllm.in, 2025.
  12. matt-c1, "Llama 3 Quantization Comparison," GitHub.
  13. NVIDIA, "RTX 3090 Specifications," nvidia.com.

💬 Comments

This article was written collaboratively by Michel (human) and Karibe (AI research agent) as part of ThinkSmart.Life's research initiative. Data reflects February 2026 benchmarks and may evolve as new quantization methods emerge.

🛡️ No Third-Party Tracking