Research Local AI Beginner Friendly

AI Model Quantization: The Complete Guide

You just got an RTX 3090 with 24 GB of VRAM. You want to run a 70-billion-parameter AI model locally. There's one problem: the model is 140 GB. This guide explains how quantization makes the impossible possible — and why modern quants lose almost nothing.

Michel Lacle & Karibe | ThinkSmart.Life Research

February 23, 2026 · min read

📺 Watch the Video Prefer video? Watch the full briefing at our Video Library.

🎧 Listen

1. Why Quantization Matters

If you're building a local AI rig — maybe you just picked up an RTX 3090 — you've probably noticed a frustrating gap between the models you want to run and the models that actually fit on your GPU. The cutting-edge open models like Llama 3.1 70B, DeepSeek V3, or Mixtral 8x22B are staggeringly large. At full precision, a 70-billion-parameter model needs 140 GB of memory just to load. Your RTX 3090 has 24 GB.

Quantization is the bridge. It's the single most important technique that makes local AI practical on consumer hardware. Without it, you'd need $15,000+ worth of enterprise GPUs. With it, you can run powerful models on a single graphics card you bought for under $800.

This guide starts from absolute zero — no ML background required — and goes deep enough that you'll understand exactly what's happening to those numbers and why it works so well.

2. What Are Model Weights?

An AI language model is, at its core, a gigantic spreadsheet of numbers. These numbers are called weights (or parameters), and they encode everything the model learned during training — its vocabulary, its understanding of grammar, its knowledge of history, its ability to write code.

When someone says "Llama 3 70B," the "70B" means 70 billion weights. Each weight is a decimal number like 0.00347 or -1.28456. During inference (when the model generates text), these numbers are multiplied together in enormous matrix operations — billions of multiply-and-add operations for every single token the model produces.

How Weights Are Stored

Computers represent decimal numbers using floating-point formats. The standard formats are:

FP32 (32-bit float) — 4 bytes per weight. The "full precision" standard. Extremely accurate, but huge.
FP16 (16-bit float) — 2 bytes per weight. Half precision. What most models are distributed in today.
BF16 (Brain Float 16) — 2 bytes per weight. Same size as FP16 but with a wider range. Google invented it specifically for ML training.

Think of it like image formats. FP32 is like a RAW photo from a DSLR — maximum quality, massive file. FP16 is like a high-quality TIFF — half the size, virtually identical to the eye. These are the "full precision" formats that models are trained in.

3. The Memory Problem

Here's the math that makes quantization essential:

Model	Parameters	FP32 Size	FP16 Size	RTX 3090 (24 GB)
Llama 3.1 8B	8 billion	32 GB	16 GB	✅ Fits in FP16
Mistral 7B	7.3 billion	29 GB	14.6 GB	✅ Fits in FP16
Llama 3.1 70B	70 billion	280 GB	140 GB	❌ 5.8× too large
Mixtral 8x22B	141 billion	564 GB	282 GB	❌ 11.8× too large
DeepSeek V3	671 billion	2,684 GB	1,342 GB	❌ 55.9× too large

The formula is simple: bytes = parameters × bytes_per_weight. At FP16 (2 bytes each), a 70B model is 70 × 2 = 140 GB. Your GPU has 24 GB. The model literally doesn't fit.

⚠️ VRAM is the bottleneck Unlike system RAM, you can't just add more VRAM. The RTX 3090 has 24 GB soldered to the board. That's it. The only way to fit larger models is to make the model smaller — which is exactly what quantization does.

4. What Quantization Does

Quantization reduces the numerical precision of each weight. Instead of storing every weight as a 16-bit float (2 bytes), you store it as an 8-bit integer (1 byte) or even a 4-bit integer (0.5 bytes). The model gets smaller proportionally.

The JPEG analogy: It's like compressing a photo from RAW to JPEG. You lose some data — but the image still looks great and takes a fraction of the space. A good JPEG at 90% quality is nearly indistinguishable from the RAW file, but it's 10× smaller. Quantization works the same way for AI models.

Format	Bits per Weight	70B Model Size	Fits on 24 GB?	Quality Loss
FP16	16	140 GB	❌ No	None (baseline)
INT8 / Q8_0	8	70 GB	❌ No	Virtually none
Q5_K_M	~5.5	~48 GB	❌ No	Very low
Q4_K_M	~4.8	~42 GB	⚠️ With offloading	Low
INT4 / Q4_0	4	~35 GB	⚠️ With offloading	Moderate
Q2_K	~2.7	~24 GB	✅ Barely	Significant

The magic is that modern quantization techniques lose almost nothing. The flagship recommendation for llama.cpp — Q4_K_M — stores weights in about 4.8 bits per weight and adds only +0.0535 perplexity compared to FP16 on a 7B model. That's a 63% size reduction with a quality loss so small you can't detect it in conversation.

✅ The key insight Neural network weights are surprisingly redundant. Most weights cluster around zero and don't need high precision. Quantization exploits this statistical property — it gives more precision to the values that matter and less to those that don't.

5. Number Formats Explained

To understand quantization, you need to know how computers represent numbers. Here's every format you'll encounter:

Floating-Point Formats (Training & Full Precision)

Format	Bits	Sign	Exponent	Mantissa	Range	Use Case
FP32	32	1	8	23	±3.4 × 10³⁸	Training, full precision inference
FP16	16	1	5	10	±65,504	Model distribution, GPU inference
BF16	16	1	8	7	±3.4 × 10³⁸	Training (same range as FP32, less precision)

FP32 is the gold standard — 23 bits of mantissa (precision) and 8 bits of exponent (range). It can represent incredibly precise values across a huge range. FP16 cuts both in half: less precision and a much smaller range (max value ~65,000). BF16 is a clever compromise — it keeps FP32's exponent (same range) but reduces precision, making it ideal for training where range matters more than precision.

Integer Formats (Quantized)

Format	Bits	Values	Typical Use
INT8	8	256 distinct levels	High-quality quantization, almost lossless
INT4	4	16 distinct levels	The sweet spot for local AI
INT2	2	4 distinct levels	Extreme compression, significant quality loss
NF4	4	16 levels (non-uniform)	BitsAndBytes — levels match normal distribution

The key difference: floating-point formats have variable precision (more precise near zero, less precise for large values), while integer formats have uniform steps. Going from FP16's 65,536 possible values to INT4's 16 values sounds catastrophic — but it works because quantization algorithms are smart about how they map values.

How Quantization Mapping Works

The simplest quantization is absmax quantization: find the maximum absolute value in a group of weights, divide all weights by that value to normalize them to [-1, 1], then multiply by 127 (for INT8) to map them to integers. To use the weights, you reverse the process (dequantize). The error comes from rounding — multiple float values map to the same integer.

Modern methods go further. Group quantization divides weights into small groups (32-128 weights each) and computes a separate scale factor per group, reducing error. K-quants (used in GGUF) use mixed precision — more important layers get higher precision (Q6_K) while less critical layers get lower precision (Q4_K). This is why Q4_K_M outperforms plain Q4_0 despite being similar size.

6. GGUF & llama.cpp Quantization Levels

llama.cpp is the most popular tool for running quantized models locally. It uses the GGUF (GPT-Generated Unified Format) file format, which stores the quantized weights along with metadata (tokenizer, architecture info, etc.) in a single file.

GGUF replaced the older GGML format in August 2023, adding better metadata support and forward compatibility. It's now the universal standard for local AI — supported by llama.cpp, Ollama, LM Studio, Jan, and more.

The Complete GGUF Quantization Table

Here's every GGUF quantization level for a 7B parameter model, from llama.cpp's official benchmarks:

Quant Type	Size (7B)	PPL Increase	Quality	Recommendation
F32	26.00 GB	+0.0000	Lossless	❌ Not recommended — too large
F16	13.00 GB	~+0.0000	Virtually lossless	❌ Not recommended — too large
Q8_0	6.70 GB	+0.0004	Indistinguishable from F16	Use if you have VRAM to spare
Q6_K	5.15 GB	+0.0044	Extremely low loss	Best quality-per-bit
Q5_K_M	4.45 GB	+0.0142	Very low loss	⭐ Recommended
Q5_K_S	4.33 GB	+0.0353	Low loss	⭐ Recommended
Q4_K_M	3.80 GB	+0.0535	Balanced	⭐ Recommended — best all-rounder
Q4_K_S	3.56 GB	+0.1149	Noticeable loss	OK for tight VRAM
Q3_K_L	3.35 GB	+0.1803	Substantial loss	Only if needed
Q3_K_M	3.06 GB	+0.2437	High quality loss	⚠️ Noticeable degradation
Q3_K_S	2.75 GB	+0.5505	Very high loss	⚠️ Not recommended
Q2_K	2.67 GB	+0.8698	Extreme loss	❌ Last resort

PPL (perplexity) measures how "surprised" the model is by text. Lower is better. The increase is relative to unquantized F16. A +0.05 increase is barely detectable; +0.25 starts to show in reasoning quality; +0.50 and above is clearly degraded.

What the Letters Mean

Q = Quantized, followed by the bit depth (Q4 = ~4 bits, Q5 = ~5 bits)
K = K-quant (mixed precision — smarter allocation across layers)
_S / _M / _L = Small / Medium / Large — controls how many layers get higher precision. _M uses Q6_K for half of attention and feed-forward weights; _S uses Q4_K everywhere; _L uses Q6_K for more layers

💡 The sweet spot: Q4_K_M Q4_K_M is the most-recommended quant for a reason. At 3.80 GB for a 7B model, it's small enough to fit large models on consumer GPUs, and the +0.0535 perplexity increase means you won't notice any quality difference in normal use. It achieves this by using Q6_K (6-bit) for the most important weight matrices (attention value projections and feed-forward second layers) and Q4_K (4-bit) for everything else.

7. GPTQ vs AWQ vs EXL2

GGUF isn't the only quantization game in town. There are three major GPU-focused quantization methods, each with different tradeoffs:

Method	Format	Target	Speed	Quality	Best For
GGUF	.gguf	CPU + GPU	Good (great with GPU offload)	Excellent (K-quants)	Local inference, mixed CPU/GPU, Ollama, LM Studio
GPTQ	.safetensors	GPU only	Fast (with Marlin kernel)	Good	GPU-only inference, vLLM, TGI servers
AWQ	.safetensors	GPU only	Very fast (with Marlin)	Very good	Production serving, best speed-quality ratio on GPU
EXL2	.safetensors	GPU only	Fastest	Excellent	Maximum inference speed, ExLlamaV2

GPTQ (GPT Quantization)

Created by Frantar et al. (2022), GPTQ was one of the first practical post-training quantization methods for LLMs. It works by quantizing weights one layer at a time, using a small calibration dataset (typically 128 samples of text) to minimize the quantization error. The key innovation is using the inverse Hessian matrix to determine which weights are most important and should be quantized more carefully.

GPTQ models are GPU-only and require CUDA. They shine when paired with the Marlin kernel — a highly optimized CUDA kernel that can achieve 741 tok/s (compared to 276 tok/s without it, based on JarvisLabs benchmarks). Without Marlin, GPTQ can actually be slower than FP16.

AWQ (Activation-Aware Weight Quantization)

AWQ, developed by Lin et al. (2023) at MIT, takes a different approach. Instead of looking at the weights themselves, it analyzes the activations — the intermediate values during inference — to determine which weights matter most. Weights that produce large activations are kept at higher precision.

AWQ tends to retain slightly better quality than GPTQ at the same bit width, especially for instruction-following and chat tasks. Like GPTQ, it benefits enormously from optimized kernels (Marlin) and is primarily GPU-focused.

EXL2 (ExLlamaV2)

ExLlamaV2, created by turboderp, is designed purely for maximum GPU inference speed. Its EXL2 format supports variable bit-width quantization — you can quantize to any target bits-per-weight (e.g., 3.5, 4.25, 5.0) and the algorithm automatically distributes precision across layers based on their sensitivity.

EXL2 is often the fastest option for single-GPU inference and produces excellent quality, but it has a smaller ecosystem (primarily used through the ExLlamaV2 library or TabbyAPI).

When to Use Which

🏠 Use GGUF When...

You're using Ollama, LM Studio, or Jan
You want to split between CPU and GPU (partial offload)
You're running on Mac (Apple Silicon)
You want the simplest setup
You need to run on CPU-only systems

🖥️ Use GPTQ/AWQ/EXL2 When...

The entire model fits in GPU VRAM
You want maximum inference speed
You're running a production API server (vLLM, TGI)
You have NVIDIA GPU with CUDA support
You're willing to use more specialized tools

8. BitsAndBytes & Emerging Methods

BitsAndBytes (bnb)

BitsAndBytes, created by Tim Dettmers, is the go-to library for on-the-fly quantization in the HuggingFace ecosystem. Unlike GPTQ/AWQ/GGUF which pre-quantize the model into a file, BitsAndBytes quantizes at load time — you pass a config flag and the model loads in 4-bit or 8-bit directly.

It uses NF4 (4-bit NormalFloat), a clever data type where the 16 quantization levels are spaced according to a normal distribution. Since neural network weights follow a roughly normal distribution, this means common values (near zero) get higher precision and rare extreme values get less. BitsAndBytes also supports double quantization — quantizing the quantization constants themselves, saving additional memory.

Pros: Zero-effort quantization (just add a config flag), no calibration data needed, works with any HuggingFace model, supports QLoRA for fine-tuning. Cons: Slower than pre-quantized formats (GPTQ/AWQ with Marlin kernels), GPU-only.

EETQ (Easy and Efficient Transformer Quantization)

EETQ provides INT8 weight-only quantization with no calibration required. It's simpler than GPTQ/AWQ but limited to 8-bit. Useful when you want a quick quality-preserving size reduction without the complexity of calibrated quantization.

HQQ (Half-Quadratic Quantization)

HQQ is a newer method that achieves calibration-free quantization by optimizing weights directly. It's faster to quantize than GPTQ/AWQ (no calibration dataset needed) and achieves competitive quality at 4-bit. It's gaining traction as an alternative when you need to quantize models quickly.

Importance Matrix (imatrix) Quantization

A recent innovation in the GGUF ecosystem. An importance matrix is computed by running calibration data through the model to identify which weights contribute most to the output. This matrix is then used during quantization to allocate more precision to important weights. GGUF models quantized with imatrix (look for "imatrix" in the filename on HuggingFace) tend to be noticeably better at lower bit widths (Q3, Q2) compared to standard quantization.

9. Real Benchmarks: Quality vs. Speed vs. VRAM

Here's what actually happens when you quantize a model. These numbers are based on community benchmarks from the LocalLLaMA community, llama.cpp's official tests, and JarvisLabs' vLLM benchmarks:

Quality (Perplexity) — Llama 3 7B

Quant	BPW	Size	PPL	PPL vs F16	Quality Verdict
F16	16.0	13.0 GB	baseline	+0.000	Reference
Q8_0	8.0	6.7 GB	≈ baseline	+0.0004	Identical in practice
Q6_K	6.6	5.2 GB	≈ baseline	+0.004	Virtually identical
Q5_K_M	5.7	4.5 GB	slight ↑	+0.014	No detectable difference
Q4_K_M	4.8	3.8 GB	slight ↑	+0.054	Negligible — the sweet spot
Q3_K_M	3.9	3.1 GB	moderate ↑	+0.244	Noticeable in complex reasoning
Q2_K	2.7	2.7 GB	large ↑	+0.870	Clearly degraded

Speed — Tokens per Second (Single RTX 3090)

Lower-bit quants are generally faster because they require less memory bandwidth — which is the primary bottleneck during inference. Typical single-GPU generation speeds on an RTX 3090 for a 7B model:

Q4_K_M: ~80-120 tok/s
Q5_K_M: ~70-100 tok/s
Q8_0: ~50-80 tok/s
F16: ~40-60 tok/s

For larger models that require CPU offloading (partial GPU), speeds drop dramatically — a 70B Q4_K_M with half the layers on CPU might give 5-15 tok/s. This is why fitting the entire model in VRAM matters so much.

Quality vs. Method (4-bit Comparison)

Method	Format	Calibration?	Quality (MMLU)	Speed
GGUF Q4_K_M	.gguf	No (imatrix optional)	~95-97% of F16	Good
AWQ 4-bit	.safetensors	Yes	~95% of F16	Very fast (w/ Marlin)
GPTQ 4-bit	.safetensors	Yes	~90-93% of F16	Fast (w/ Marlin)
EXL2 4.0bpw	.safetensors	Yes	~95-97% of F16	Fastest
BnB NF4	in-memory	No	~93-95% of F16	Moderate

10. What Fits on a 24 GB RTX 3090?

This is the practical section. You've got 24 GB of VRAM. Here's exactly what you can run:

💡 Rule of thumb Reserve ~2-4 GB of VRAM for the KV cache (context window) and inference overhead. So you effectively have ~20-22 GB for model weights.

Model	Quant	Size	Fits in 24 GB?	Speed	Quality
Llama 3.1 8B	Q8_0	8.5 GB	✅ Easily	~60-80 tok/s	★★★★★
Llama 3.1 8B	Q4_K_M	4.9 GB	✅ Easily	~90-120 tok/s	★★★★½
Mistral Nemo 12B	Q5_K_M	8.7 GB	✅ Easily	~50-70 tok/s	★★★★★
Qwen 2.5 14B	Q4_K_M	8.9 GB	✅ Yes	~40-60 tok/s	★★★★½
Codestral 22B	Q4_K_M	13.2 GB	✅ Yes	~30-45 tok/s	★★★★½
Llama 3.3 70B	Q2_K	~25 GB	⚠️ Tight (needs offload)	~5-10 tok/s	★★★
Llama 3.3 70B	Q4_K_M	~42 GB	❌ No (needs ~2× 3090)	—	—
DeepSeek-R1 (distill) 14B	Q4_K_M	~8.9 GB	✅ Yes	~40-55 tok/s	★★★★½

The Sweet Spots for 24 GB

🎯 Best Quality

7-8B models at Q8_0 or Q6_K — near-lossless quality
12-14B models at Q5_K_M — excellent quality, fits comfortably
22B models at Q4_K_M — great balance

🚀 Maximum Capability

32-34B models at Q3_K_M to Q4_K_S — pushing the limits
70B models at Q2_K with offloading — slow but possible
MoE models (Mixtral 8x7B at Q4_K_M) — only active experts need VRAM

✅ Recommendation for Michel's RTX 3090 Start with Llama 3.1 8B at Q8_0 or Qwen 2.5 14B at Q4_K_M. Both fit easily in 24 GB with room for large context windows. Use Ollama or LM Studio for the easiest setup — they handle GGUF files natively. As you get comfortable, try the 22B and 32B class models at Q4_K_M to push your GPU.

11. How to Choose Your Quantization

Here's a simple decision tree:

Calculate model size at your target quant. Rule of thumb: size_GB ≈ parameters_B × bits_per_weight / 8. A 14B model at Q4_K_M (~4.8 bpw) = 14 × 4.8 / 8 ≈ 8.4 GB.
Add 2-4 GB for KV cache and overhead. So that 8.4 GB model needs ~11-12 GB total VRAM.
Does it fit? If yes, use the highest quality quant that fits. If not, either pick a smaller model or a lower quant.
Pick the right format:
- Using Ollama/LM Studio/Jan? → GGUF (Q4_K_M or Q5_K_M)
- Running vLLM/TGI API server? → AWQ or GPTQ
- Maximum speed, single GPU? → EXL2 via ExLlamaV2
- Fine-tuning with QLoRA? → BitsAndBytes NF4
Always prefer a smaller model at higher quant over a larger model at lower quant. A 14B Q5_K_M will generally outperform a 70B Q2_K — the extreme compression destroys too much quality.

12. Getting Started

With Ollama (Easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run a quantized model (Ollama handles GGUF automatically)
ollama run llama3.1:8b          # Default quant (Q4_K_M)
ollama run llama3.1:8b-q8_0     # Higher quality quant
ollama run qwen2.5:14b          # 14B model, Q4_K_M
ollama run deepseek-r1:14b      # DeepSeek R1 distill

With llama.cpp (More Control)

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j GGML_CUDA=1

# Download a GGUF from HuggingFace (e.g., from TheBloke or bartowski)
# Then run:
./llama-server -m model-Q4_K_M.gguf -ngl 99 -c 4096

# -ngl 99 = offload all layers to GPU
# -c 4096 = context length

With LM Studio (GUI)

Download LM Studio, search for any model, and it shows available GGUF quants with size estimates. Click download, click run. It automatically detects your GPU and offloads as many layers as possible.

Quantizing Your Own Model

# Using llama.cpp's quantize tool
./llama-quantize input-model-f16.gguf output-Q4_K_M.gguf Q4_K_M

# With importance matrix (better quality at low bits)
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
./llama-quantize --imatrix imatrix.dat model-f16.gguf output-Q4_K_M.gguf Q4_K_M

References

Frantar, E., et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," arXiv:2210.17323, 2022.
Lin, J., et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," arXiv:2306.00978, 2023.
Dettmers, T., et al., "QLoRA: Efficient Finetuning of Quantized Language Models," arXiv:2305.14314, 2023.
ggml-org, "llama.cpp Quantization Types," GitHub Discussion #2094.
Maarten Grootendorst, "Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)," Substack, 2023.
turboderp, "ExLlamaV2," GitHub.
Tim Dettmers, "BitsAndBytes," GitHub.
JarvisLabs, "The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices," JarvisLabs Docs, 2026.
Ionio AI, "LLMs on CPU: The Power of Quantization with GGUF, AWQ, & GPTQ," ionio.ai.
Hardware Corner, "Quantization for Local LLMs: How It Works and Which Formats Fit Your Setup," hardware-corner.net, 2025.
LocalLLM.in, "The Complete Guide to LLM Quantization," localllm.in, 2025.
matt-c1, "Llama 3 Quantization Comparison," GitHub.
NVIDIA, "RTX 3090 Specifications," nvidia.com.

💬 Comments

This article was written collaboratively by Michel (human) and Karibe (AI research agent) as part of ThinkSmart.Life's research initiative. Data reflects February 2026 benchmarks and may evolve as new quantization methods emerge.