Why Model Format Is Not a Minor Detail
When you download a language model to run locally, you're choosing between two fundamentally different ways of storing and loading the model weights. That choice affects inference speed, memory usage, model availability, and how well the model uses your hardware.
The model format determines:
- How quickly your model loads: A single binary file with optimized offsets vs. loading dozens of weight shards from disk
- Whether your GPU actually helps: Some formats work better with Metal GPU acceleration, others are CPU-optimized
- What hardware you can run on: GGUF runs on NVIDIA, AMD, Apple Silicon, or even bare CPU. MLX is Apple-only.
- Model availability: Some formats get community-converted models within hours of a new release; others wait days or weeks
- Quantization quality: The math behind how weights get compressed affects both speed and output quality
For a software engineer with a Mac Studio M3 Ultra and a need to run large language models locally, understanding these differences isn't academic — it's the difference between a smooth interactive experience and one that feels sluggish and broken.
GGUF Internals: How It Stores a Model
GGUF stands for "GPT-Generated Unified Format" (though it evolved from Georgi Gerganov's earlier GGML project). It was created specifically for the llama.cpp inference engine and has become the de facto standard for local LLM deployment.
The GGUF File Structure
A GGUF file is a single self-contained binary with three main components:
- Header (24 bytes): Magic number
GGUF+ version number, which tells the reader which format specification to use - Metadata (key-value pairs): Flexible schema describing the model architecture, context length, tokenizer information, and quantization parameters
- Tensor information: For each tensor: name, shape, and file offset indicating where that tensor's data begins in the file
- Weight data (quantized): The actual model weights, grouped and aligned to 32-byte boundaries for optimal memory access
This block-based structure means the file is fully seekable — you can jump to any tensor without reading the whole file. This is crucial for loading large models efficiently, especially when you're running out of memory and need to selectively load layers.
The Quantization Design Philosophy
GGUF's design centers on block-based quantization. Rather than quantizing all weights together, GGUF groups weights into blocks of either 32 or 256 values, with each block having its own scale factor. This allows the format to adapt its precision to local weight distributions — some parts of a neural network are more sensitive to quantization than others, and this design acknowledges that.
Every block has a scale factor and optionally a zero point, stored as 8-bit values. The actual quantized weights are stored as compact integers (2-bit, 3-bit, 4-bit, 5-bit, 6-bit, or 8-bit depending on the quantization level). The scale and zero point are stored per-block, not per-weight, which is the key compression mechanism.
This design choice — per-block quantization — is what gives GGUF its flexibility. Different quantization families (Legacy, K-Quants, I-Quants) apply different strategies to this basic block structure, trading off between quality, speed, and compression ratio.
The Quantization Math: From Float to 4-Bit
At its core, quantization is a simple mathematical transformation: compressing high-precision floating-point weights into lower-bit integers. But the details matter enormously for both speed and output quality.
The Basic Quantization Formula
The quantization and dequantization formulas are:
Quantize: q = round((weight - zero_point) / scale)
Dequantize: weight' = scale × q + zero_point
Where:
weightis the original 16-bit floating-point weight (FP16)qis the quantized integer value (8, 16, 32, or 64-bit depending on the scheme)scaleis a 8-bit floating-point scale factor that maps the weight range to the integer rangezero_pointis a bias that shifts the weight range to align with the integer encoding (used in asymmetric quantization)
Per-Block vs. Per-Weight Scale
GGUF uses per-block scales, meaning each block of weights shares a single scale factor. This is different from per-weight quantization (where every weight has its own scale) or per-layer quantization (where an entire layer shares one scale).
For a typical block size of 32 weights:
- FP16 storage: Each weight takes 2 bytes (16 bits) — total for 32 weights = 64 bytes
- Q4_K storage: Each weight takes 4 bits = 0.5 bytes. 32 weights = 16 bytes. Plus a 1-byte scale factor = 17 bytes total. This is a 3.76:1 compression ratio for this block.
- Q8_0 storage: Each weight takes 8 bits = 1 byte. 32 weights = 32 bytes. Plus a 1-byte scale = 33 bytes total. 1.94:1 compression.
Type-0 vs. Type-1 Quantization
GGUF supports two quantization types, distinguished by how they handle the zero point:
Type-0 (Symmetric Quantization)
weight = scale × q
This is the simplest form: no zero point, just a scale factor. Weights are centered around zero and scaled to fit the integer range. Examples: Q4_0, Q5_0, Q8_0. This is symmetric because the quantization range is symmetric around zero.
Type-1 (Asymmetric Quantization)
weight = scale × q + minimum
This includes a minimum offset (similar to zero_point but expressed differently). The range is shifted so that it doesn't have to be symmetric around zero. Examples: Q4_1, Q5_1, and the entire K-Quant family uses Type-1 with a more sophisticated structure.
Type-1 is generally more accurate for weights that aren't symmetric around zero — which is often the case in real neural network weights. But it costs an extra value (the minimum) to store per-block.
GGUF Quantization Families: Legacy, K-Quants, I-Quants
GGUF's power comes from three distinct quantization families, each optimized for different tradeoffs. Understanding which one to use is crucial for getting the best performance on your hardware.
1. Legacy Quants: The Simple Ones
These are the original GGUF quantization schemes — simple, fast, and widely supported.
Q4_0 and Q5_0
These use Type-0 symmetric quantization with a single scale per 32 weights. No zero point, no complexity. Just scale and quantized weights.
- Q4_0: 4 bits per weight, 32 weights per block. One 8-bit scale factor per 32 weights.
- Q5_0: 5 bits per weight, 32 weights per block. One 8-bit scale factor per 32 weights.
- Q8_0: 8 bits per weight, 32 weights per block. Near lossless — perplexity impact is only ~0.01 compared to full precision.
These are the fastest quantizations because the dequantization math is trivial: multiply the integer by the scale and you're done. No lookups, no complex bit operations. Perfect for CPU inference or when you need maximum speed.
Q4_1 and Q5_1
These use Type-1 asymmetric quantization. Each block has a scale and a minimum value. Slightly more accurate than Type-0 for the same bitwidth, but the extra offset value adds a tiny overhead.
2. K-Quants: The Smart Ones
K-Quants are GGUF's premium quantization family. They use a hierarchical structure that achieves significantly better quality at the same size compared to legacy quants.
The Super-Block Structure
Instead of a flat block of 32 weights, K-Quants organize weights into a super-block of 256 values. Within this super-block:
- The 256 weights are split into 8 sub-blocks of 32 weights each
- Each sub-block has its own scale factor (Type-1 asymmetric)
- The 8 scale factors themselves are quantized into a compact vector
This is double quantization: the scales themselves are compressed. This reduces the overhead of storing scale factors, allowing more bits for the actual weights at the same total size.
K-Quant Variants: _XS, _S, _M, _L
The suffixes refer to different mixes of quantization types across different layers of the model:
- _XS (Extra Small): More aggressive quantization on less critical layers
- _S (Small): Moderate mix
- _M (Medium): Balanced distribution (most common)
- _L (Large): More precision on important layers, less on others
This layer-aware approach works because some layers in a neural network are more sensitive to quantization than others. K-Quants identify which layers matter most (often through an importance matrix calibration, discussed below) and allocate more bits to them.
K-Quant Quantization Levels
- Q2_K: Very aggressive, ~0.65 bits per weight average
- Q3_K_M: Balanced, ~0.7 bits per weight average
- Q4_K_M: Excellent quality, ~0.75 bits per weight average
- Q5_K_M: High quality, ~0.92 bits per weight average
- Q6_K: Near lossless, ~1.2 bits per weight average
- Q8_K: Essentially lossless (but larger than Q8_0 due to overhead)
3. I-Quants: The Experimental Ones
I-Quants (Importance Quantization) take a completely different approach, inspired by QuIP# research. They're powerful but come with caveats.
The Lookup Table Approach
Instead of storing a scale factor per block, I-Quants use a pre-computed lookup table of optimal quantization vectors. During inference, the dequantization process becomes a lookup operation — faster in some cases, but more memory-intensive.
Examples include IQ2_XXS, IQ3_S, IQ4_XS, IQ5_S. These achieve remarkable compression ratios — IQ2_XXS can quantize to just ~0.6 bits per weight — while maintaining surprisingly good quality.
The Importance Matrix (imatrix)
Both K-Quants and I-Quants can optionally use an importance matrix (imatrix) to calibrate which weights matter most. This is a calibration dataset that identifies which weights have the greatest impact on output quality if quantized more aggressively.
The workflow:
- Run a calibration dataset through the full-precision model
- Measure the sensitivity of each weight to quantization error
- Generate a matrix of importance scores
- Use this matrix during quantization to allocate more bits to important weights
MLX Format: Apple's Native Approach
While GGUF was designed for cross-platform inference, MLX was built from the ground up for Apple Silicon's unique architecture. The two approaches reflect fundamentally different design philosophies.
The MLX Directory Structure
MLX models are not a single file — they're a directory containing multiple component files:
- config.json: Model architecture details (number of layers, heads, hidden size, etc.)
- model.safetensors or *.safetensors shards: The quantized weight matrices in SafeTensor format (a zero-copy, memory-mapped format designed for tensor data)
- tokenizer.json: Tokenizer vocabulary and special tokens configuration
- tokenizer_config.json: Chat template and tokenizer metadata
- quantization_config.json: Quantization parameters used during conversion
This structure is less portable than GGUF's single file, but it's optimized for how MLX loads and executes models on Apple hardware.
SafeTensors: The Storage Backend
MLX uses the SafeTensors format for weight storage. SafeTensors is designed for zero-copy loading: the tensor metadata is at the start of the file, allowing the reader to map tensors directly from disk into memory without intermediate copies.
This is particularly important for Apple Silicon because:
- Memory bandwidth is the primary bottleneck for LLM inference
- Zero-copy loading means the GPU can read weights directly from unified memory
- No CPU overhead copying data between different memory spaces
Apple Silicon Optimization Philosophy
MLX treats unified memory as the primary design constraint. Unlike NVIDIA-style systems where GPU and CPU have separate memory pools connected by PCIe, Apple Silicon's unified memory is a single pool shared between CPU and GPU cores.
MLX's architecture:
- No memory copies: Models exist in one memory space; CPU and GPU access the same data
- Lazy evaluation: Computation graphs are built and optimized before execution (similar to JAX)
- Kernel fusion: Multiple operations are fused into single GPU kernels, reducing memory passes
- Hardware-tuned kernels: The compute kernels are written and optimized specifically for Apple's GPU architecture, not ported from CUDA or x86 code paths
mlx-lm: The Inference Engine
MLX's LLM inference is provided by the mlx-lm package. The core conversion command is:
mlx_lm.convert --hf-path Qwen/Qwen2.5-72B-Instruct -q --upload-repo mlx-community/...
This converts a HuggingFace model (typically in SafeTensors format) into MLX format, optionally quantizing it at the same time. The output is a directory that can be served via mlx_lm.server or used directly with the Python API.
mlx-community organization on HuggingFace with over 3,000 pre-converted, pre-quantized models. This includes most popular open models: Llama 3, Mistral, Qwen, Phi, Gemma, and more. Conversion and quantization can be done locally in seconds, or you can just pull from the community repo.
MLX Quantization: Affine Group Quantization
MLX uses a different quantization approach than GGUF, optimized for Apple Silicon's memory architecture and compute characteristics.
The Affine Quantization Formula
MLX's quantization uses affine group quantization, where a group of weights shares a scale and bias:
Quantized = round(w / scale + bias)
Where:
wis the original weight valuescaleis a scalar for the groupbias(or zero_point) is a shift value
Group Size: 64 Weights per Group
MLX's default configuration is:
- Group size: 64 weights per quantization group (this can be adjusted in
mlx.core.quantize) - Bits: 4 bits per weight by default for standard 4-bit quantization
- Mode: "affine" (other modes: "mxfp4", "mxfp8", "nvfp4")
This means every 64 weights in a row share a single scale and bias value. This is a larger group size than GGUF's typical 32 or 256, which affects both quality and compression ratio.
Supported Quantization Modes
MLX supports multiple quantization modes:
- affine: Standard affine quantization — the default, good balance of speed and quality
- mxfp4: Microsoft's mixed-precision FP4 format, optimized for neural networks
- mxfp8: Mixed-precision FP8, better quality at higher bitwidth
- nvfp4: NVIDIA's FP4 format, similar to MXFP4 but with NVIDIA-specific tuning
The Quantization Call
Programmatic quantization in MLX:
import mlx.core as mx
# Quantize a weight matrix
quantized = mx.quantize(w, group_size=64, bits=4, mode="affine")
# Dequantize back to approximate FP16
dequantized = mx.dequantize(quantized, group_size=64, bits=4, mode="affine")
The requirement that the last dimension be divisible by group_size means that some models may need padding. This is typically handled automatically during the mlx_lm.convert process.
Metal-Native Kernels
Most importantly, MLX's quantized kernels are written specifically for Apple GPU architecture. Unlike llama.cpp which was ported from CPU-first code to Metal, MLX's kernels are designed from day one to leverage Apple's memory hierarchy and compute units.
This results in:
- Faster kernel launch overhead
- Better utilization of Apple's memory bandwidth
- Optimized tensor cores for Apple's GPU architecture
- No CPU-side overhead for quantization/dequantization — all on GPU
Ecosystem: Model Availability and Community
Beyond the technical differences, the two formats have vastly different ecosystems, which directly impacts what models you can run and how quickly.
GGUF Ecosystem
Model availability: 40+ architectures supported, with new model releases getting GGUF conversions within 24 hours. Almost every open model has GGUF versions on HuggingFace.
Community size: The llama.cpp project has a large, mature community. The r/LocalLLaMA subreddit has hundreds of thousands of members sharing quantized models, benchmarks, and troubleshooting tips.
Cross-platform: GGUF runs on any hardware: NVIDIA GPU, AMD GPU, Apple Silicon, or pure CPU. This means the same model file works everywhere.
Tooling: GGUF is the backbone of multiple inference servers: llama.cpp, Ollama, LM Studio, llamafile, and more. This creates a rich ecosystem of tools and integrations.
MLX Ecosystem
Model availability: The mlx-community organization on HuggingFace has 3,000+ pre-converted models. This is a growing but smaller selection compared to GGUF.
Community: Active, Apple-backed community with regular updates. New model conversions typically appear within days of a model's release, not hours like GGUF.
Apple Silicon only: MLX models only run on Apple Silicon. This is a trade-off for the performance optimization — you get better speed, but only on one platform.
Conversion tooling: Converting a model locally is straightforward with mlx_lm.convert. If a model isn't in the mlx-community repo, you can convert and quantize it yourself in seconds.
Model Size and Format Comparison
| Model | Quantization | GGUF Size | MLX Size | Notes |
|---|---|---|---|---|
| Qwen2.5-72B | Q4_K_M / 4-bit | ~40 GB (GGUF) | ~35 GB (MLX) | MLX is slightly smaller due to different compression |
| Llama3-70B | Q4_K_M / 4-bit | ~41 GB (GGUF) | ~36 GB (MLX) | Similar compression ratio |
| Mistral-8x7B | Q4_K_M / 4-bit | ~24 GB (GGUF) | ~21 GB (MLX) | Both formats efficient on sparse models |
| Phi-3.5-mini | Q4_K_M / 4-bit | ~3 GB (GGUF) | ~2.5 GB (MLX) | Small models have minimal format overhead |
| Qwen3-32B | Q4_K_M / 4-bit | ~19 GB (GGUF) | ~17 GB (MLX) | Similar compression for mid-sized models |
MLX models tend to be slightly smaller for the same quantization level, but the difference is marginal (2-5%). The real ecosystem difference is availability and hardware support, not file size.
Performance: When Each Format Wins
This is where the rubber meets the road. The choice between GGUF and MLX isn't just about file formats — it's about which one delivers better performance for your specific workload.
Benchmark Reality: M1 Max Mac Studio
Benchmarks on the same hardware (M1 Max Mac Studio) with the same model (Qwen3.5-35B-A3B) reveal a stark pattern:
| Engine | Quantization | Generation Speed | Prefill (1K tokens) | Total (1K in, 200 out) |
|---|---|---|---|---|
| MLX 4-bit | affine, 4-bit | 57 tok/s | 15–20s | ~19s |
| GGUF Q4_K_M | K-Quants | 29 tok/s | 3–5s | ~11s |
| oMLX | SSD KV cache | ~55 tok/s | ~1.7s @ 8K | ~5s |
The headline numbers tell a misleading story. MLX wins generation speed (57 tok/s vs 29 tok/s), but GGUF wins end-to-end on typical workloads because of its superior prefill latency.
Understanding Prefill vs. Generation
LLM inference has two distinct phases:
1. Prefill (Input Processing)
This is where the input prompt is processed in parallel. All tokens are computed simultaneously, which is compute-intensive but fast. The output is a KV cache (key-value pairs for attention) plus the first output token.
Prefill latency is the time from sending the prompt to receiving the first output token. This dominates user-perceived latency for anything longer than a few hundred tokens.
2. Generation (Decode Phase)
This is where output tokens are generated one at a time. Each new token requires loading the model weights from memory, computing the next token, and repeating. This is memory-bandwidth bound, not compute-bound.
Generation speed is measured in tokens per second (tok/s). This is what most benchmarks report, but it's the wrong metric for most real-world use cases.
When MLX Wins
MLX is the better choice when:
- Short-context workloads: Prompts under 1K tokens where prefill latency is negligible
- Maximum generation throughput: Tasks where you're generating thousands of tokens (e.g., long-form content generation, summarization)
- Brief interaction patterns: Single queries with long outputs, not conversational loops
- M3/M4/M5 hardware: Newer Apple Silicon benefits more from MLX's Metal-native kernels
When GGUF Wins
GGUF is the better choice when:
- Long-context workloads: Prompts over 2K tokens where prefill latency dominates
- Agentic workloads: Tool use, JSON history, or system prompts of 1K+ tokens
- Conversational use: Interactive chat where time-to-first-token matters more than total throughput
- Cross-platform needs: If you need to run the same model on different hardware
The oMLX Bridge
There's a third option: oMLX (tiered KV cache with SSD persistence). It achieves MLX-level generation speeds (55 tok/s) while reducing prefill to near GGUF levels (1.7s at 8K context). This is experimental but worth considering for long-context, high-throughput workloads.
Decision Framework: Which to Use When
Here's a practical decision framework to help you choose the right format for your use case.
General-Purpose Use
For most workloads, especially if you're just starting out:
- Format: GGUF Q4_K_M
- Why: Best balance of speed, quality, and availability. Works on any hardware, has the widest model selection, and the prefill performance feels snappier for typical use cases.
- Tooling: Ollama, LM Studio, or llama.cpp directly
Agentic Workloads (Long Context)
For tool use, RAG systems, or any workload with system prompts over 1K tokens:
- Format: GGUF Q4_K_M or oMLX
- Why: The prefill latency matters more than generation speed here. A 4K-token system prompt processed in 3-5 seconds (GGUF) vs 15-20 seconds (MLX) makes MLX feel unusable even though its tok/s number is higher.
- Tooling: llama.cpp server or oMLX for best performance
Maximum Throughput (Short Context)
For batch processing or tasks where you need to generate thousands of tokens per query:
- Format: MLX 4-bit
- Why: The 57 tok/s generation speed will save significant total time, especially for tasks like long-form content generation or summarization pipelines.
- Tooling: Direct mlx-lm server, not through a wrapper like Ollama
Model Availability Priority
If you need the newest models immediately or need obscure architectures:
- Format: GGUF
- Why: New model releases get GGUF conversions within hours. MLX conversions typically take days. The GGUF ecosystem also supports 40+ architectures vs MLX's more limited (though growing) selection.
What to Avoid
Practical Recommendation Summary
| Use Case | Recommended Format | Why |
|---|---|---|
| General-purpose, chat, exploration | GGUF Q4_K_M | Best balance, widest model availability, snappy prefill |
| Agentic, RAG, long system prompts | GGUF Q4_K_M | Prefill latency critical for usability |
| Long-form content generation | MLX 4-bit | Generation speed dominates total time |
| Brief queries with massive outputs | MLX 4-bit | High tok/s matters when generating thousands of tokens |
| Long-context, high-throughput | oMLX | MLX speeds with GGUF-like prefill latency |
| Cross-platform deployment | GGUF | Runs on NVIDIA, AMD, Apple, or CPU |
References
- GGUF Format Internals (Medium, Dec 2025) — Comprehensive technical deep dive into GGUF structure, quantization families, and the math behind block-based quantization. ↗ Medium
- r/LocalLLaMA — GGUF Quant Methods — Community insights on K-Quants vs I-Quants, performance on Apple Silicon, and practical recommendations for Mac users. ↗ Reddit
- MLX Official Documentation (v0.31.1) — Apple's official docs on MLX quantization, affine group quantization, and model conversion workflow. ↗ MLX Docs
- mlx-community on HuggingFace — 3,000+ pre-converted MLX models with detailed model cards and quantization information. ↗ HuggingFace
- famstack.dev — "57 tok/s on Screen, 3 tok/s in Practice: MLX vs llama.cpp on Apple Silicon" (Mar 2026) — Detailed real-world benchmark on M1 Max Mac Studio covering MLX prefill problem, K-Quants vs I-Quants performance, and oMLX tiered KV cache results. ↗ famstack.dev
- r/LocalLLaMA — K-Quant Suffixes Explained — Explanation of _XS, _S, _M, _L suffixes and importance matrix usage for optimal quantization. ↗ Reddit