⚖️ GGUF vs MLX: A Deep Dive Into LLM Model Formats

The quantization math, format internals, performance tradeoffs, and community ecosystem — everything you need to decide which format to use on Apple Silicon.

March 17, 2026 · 18 min read

📺 Watch the video version:

🎧

Listen to this article

Why Model Format Is Not a Minor Detail

When you download a language model to run locally, you're choosing between two fundamentally different ways of storing and loading the model weights. That choice affects inference speed, memory usage, model availability, and how well the model uses your hardware.

The model format determines:

How quickly your model loads: A single binary file with optimized offsets vs. loading dozens of weight shards from disk
Whether your GPU actually helps: Some formats work better with Metal GPU acceleration, others are CPU-optimized
What hardware you can run on: GGUF runs on NVIDIA, AMD, Apple Silicon, or even bare CPU. MLX is Apple-only.
Model availability: Some formats get community-converted models within hours of a new release; others wait days or weeks
Quantization quality: The math behind how weights get compressed affects both speed and output quality

For a software engineer with a Mac Studio M3 Ultra and a need to run large language models locally, understanding these differences isn't academic — it's the difference between a smooth interactive experience and one that feels sluggish and broken.

ℹ️ What This Article Covers This deep dive examines two of the most important LLM formats in 2026: GGUF (the cross-platform format powering llama.cpp, Ollama, and LM Studio) and MLX (Apple's native format for unified memory architectures). We'll explore the quantization math behind each, the format internals, real-world performance tradeoffs, and a decision framework to help you choose the right tool for your workload.

GGUF Internals: How It Stores a Model

GGUF stands for "GPT-Generated Unified Format" (though it evolved from Georgi Gerganov's earlier GGML project). It was created specifically for the llama.cpp inference engine and has become the de facto standard for local LLM deployment.

The GGUF File Structure

A GGUF file is a single self-contained binary with three main components:

Header (24 bytes): Magic number GGUF + version number, which tells the reader which format specification to use
Metadata (key-value pairs): Flexible schema describing the model architecture, context length, tokenizer information, and quantization parameters
Tensor information: For each tensor: name, shape, and file offset indicating where that tensor's data begins in the file
Weight data (quantized): The actual model weights, grouped and aligned to 32-byte boundaries for optimal memory access

This block-based structure means the file is fully seekable — you can jump to any tensor without reading the whole file. This is crucial for loading large models efficiently, especially when you're running out of memory and need to selectively load layers.

💡 Single-File Simplicity Unlike MLX's directory structure, GGUF is one file. This makes it trivial to share, version control, and distribute. The entire model — including the tokenizer and architecture config — is in one package.

The Quantization Design Philosophy

GGUF's design centers on block-based quantization. Rather than quantizing all weights together, GGUF groups weights into blocks of either 32 or 256 values, with each block having its own scale factor. This allows the format to adapt its precision to local weight distributions — some parts of a neural network are more sensitive to quantization than others, and this design acknowledges that.

Every block has a scale factor and optionally a zero point, stored as 8-bit values. The actual quantized weights are stored as compact integers (2-bit, 3-bit, 4-bit, 5-bit, 6-bit, or 8-bit depending on the quantization level). The scale and zero point are stored per-block, not per-weight, which is the key compression mechanism.

This design choice — per-block quantization — is what gives GGUF its flexibility. Different quantization families (Legacy, K-Quants, I-Quants) apply different strategies to this basic block structure, trading off between quality, speed, and compression ratio.

The Quantization Math: From Float to 4-Bit

At its core, quantization is a simple mathematical transformation: compressing high-precision floating-point weights into lower-bit integers. But the details matter enormously for both speed and output quality.

The Basic Quantization Formula

The quantization and dequantization formulas are:

Quantize:     q = round((weight - zero_point) / scale)
Dequantize:   weight' = scale × q + zero_point

Where:

weight is the original 16-bit floating-point weight (FP16)
q is the quantized integer value (8, 16, 32, or 64-bit depending on the scheme)
scale is a 8-bit floating-point scale factor that maps the weight range to the integer range
zero_point is a bias that shifts the weight range to align with the integer encoding (used in asymmetric quantization)

Per-Block vs. Per-Weight Scale

GGUF uses per-block scales, meaning each block of weights shares a single scale factor. This is different from per-weight quantization (where every weight has its own scale) or per-layer quantization (where an entire layer shares one scale).

For a typical block size of 32 weights:

FP16 storage: Each weight takes 2 bytes (16 bits) — total for 32 weights = 64 bytes
Q4_K storage: Each weight takes 4 bits = 0.5 bytes. 32 weights = 16 bytes. Plus a 1-byte scale factor = 17 bytes total. This is a 3.76:1 compression ratio for this block.
Q8_0 storage: Each weight takes 8 bits = 1 byte. 32 weights = 32 bytes. Plus a 1-byte scale = 33 bytes total. 1.94:1 compression.

⚖️ The Tradeoff Per-block quantization is a compromise. It's more accurate than per-layer quantization because it adapts to local weight distributions. It's more compact than per-weight quantization because it only stores one scale per 32 or 256 weights. This is why GGUF's quantization families all use blocks — it's the sweet spot between quality and compression.

Type-0 vs. Type-1 Quantization

GGUF supports two quantization types, distinguished by how they handle the zero point:

Type-0 (Symmetric Quantization)

weight = scale × q

This is the simplest form: no zero point, just a scale factor. Weights are centered around zero and scaled to fit the integer range. Examples: Q4_0, Q5_0, Q8_0. This is symmetric because the quantization range is symmetric around zero.

Type-1 (Asymmetric Quantization)

weight = scale × q + minimum

This includes a minimum offset (similar to zero_point but expressed differently). The range is shifted so that it doesn't have to be symmetric around zero. Examples: Q4_1, Q5_1, and the entire K-Quant family uses Type-1 with a more sophisticated structure.

Type-1 is generally more accurate for weights that aren't symmetric around zero — which is often the case in real neural network weights. But it costs an extra value (the minimum) to store per-block.

GGUF Quantization Families: Legacy, K-Quants, I-Quants

GGUF's power comes from three distinct quantization families, each optimized for different tradeoffs. Understanding which one to use is crucial for getting the best performance on your hardware.

1. Legacy Quants: The Simple Ones

These are the original GGUF quantization schemes — simple, fast, and widely supported.

Q4_0 and Q5_0

These use Type-0 symmetric quantization with a single scale per 32 weights. No zero point, no complexity. Just scale and quantized weights.

Q4_0: 4 bits per weight, 32 weights per block. One 8-bit scale factor per 32 weights.
Q5_0: 5 bits per weight, 32 weights per block. One 8-bit scale factor per 32 weights.
Q8_0: 8 bits per weight, 32 weights per block. Near lossless — perplexity impact is only ~0.01 compared to full precision.

These are the fastest quantizations because the dequantization math is trivial: multiply the integer by the scale and you're done. No lookups, no complex bit operations. Perfect for CPU inference or when you need maximum speed.

Q4_1 and Q5_1

These use Type-1 asymmetric quantization. Each block has a scale and a minimum value. Slightly more accurate than Type-0 for the same bitwidth, but the extra offset value adds a tiny overhead.

⚠️ Legacy Quant Caveat While Q8_0 is near lossless, the lower legacy quants (Q4_0, Q5_0) can lose noticeable quality compared to more sophisticated schemes. They're fast, but if you want good quality at low bitwidths, K-Quants are better.

2. K-Quants: The Smart Ones

K-Quants are GGUF's premium quantization family. They use a hierarchical structure that achieves significantly better quality at the same size compared to legacy quants.

The Super-Block Structure

Instead of a flat block of 32 weights, K-Quants organize weights into a super-block of 256 values. Within this super-block:

The 256 weights are split into 8 sub-blocks of 32 weights each
Each sub-block has its own scale factor (Type-1 asymmetric)
The 8 scale factors themselves are quantized into a compact vector

This is double quantization: the scales themselves are compressed. This reduces the overhead of storing scale factors, allowing more bits for the actual weights at the same total size.

K-Quant Variants: _XS, _S, _M, _L

The suffixes refer to different mixes of quantization types across different layers of the model:

_XS (Extra Small): More aggressive quantization on less critical layers
_S (Small): Moderate mix
_M (Medium): Balanced distribution (most common)
_L (Large): More precision on important layers, less on others

This layer-aware approach works because some layers in a neural network are more sensitive to quantization than others. K-Quants identify which layers matter most (often through an importance matrix calibration, discussed below) and allocate more bits to them.

💡 The Sweet Spot: Q4_K_M For most workloads on Apple Silicon, Q4_K_M is the optimal choice. It gives excellent quality at 4 bits per weight, with the hierarchical structure meaning the scale factor overhead is minimal. This is the recommended quantization for general-purpose use.

K-Quant Quantization Levels

Q2_K: Very aggressive, ~0.65 bits per weight average
Q3_K_M: Balanced, ~0.7 bits per weight average
Q4_K_M: Excellent quality, ~0.75 bits per weight average
Q5_K_M: High quality, ~0.92 bits per weight average
Q6_K: Near lossless, ~1.2 bits per weight average
Q8_K: Essentially lossless (but larger than Q8_0 due to overhead)

3. I-Quants: The Experimental Ones

I-Quants (Importance Quantization) take a completely different approach, inspired by QuIP# research. They're powerful but come with caveats.

The Lookup Table Approach

Instead of storing a scale factor per block, I-Quants use a pre-computed lookup table of optimal quantization vectors. During inference, the dequantization process becomes a lookup operation — faster in some cases, but more memory-intensive.

Examples include IQ2_XXS, IQ3_S, IQ4_XS, IQ5_S. These achieve remarkable compression ratios — IQ2_XXS can quantize to just ~0.6 bits per weight — while maintaining surprisingly good quality.

⚠️ I-Quants on Apple Silicon: Avoid I-Quants require extra memory access for the lookup tables. On Apple Silicon and low-compute hardware, this makes them 50% slower than K-Quants despite the similar compression ratios. The lookup table dequantization is CPU-bound and doesn't leverage Apple GPU efficiently. For Mac users, stick with K-Quants.

The Importance Matrix (imatrix)

Both K-Quants and I-Quants can optionally use an importance matrix (imatrix) to calibrate which weights matter most. This is a calibration dataset that identifies which weights have the greatest impact on output quality if quantized more aggressively.

The workflow:

Run a calibration dataset through the full-precision model
Measure the sensitivity of each weight to quantization error
Generate a matrix of importance scores
Use this matrix during quantization to allocate more bits to important weights

💡 imatrix = Free Quality Boost The importance matrix works with all quantization types, not just I-Quants. It's a one-time calibration step that improves quality for free — the quantized model automatically allocates more bits to important layers based on the imatrix data.

MLX Format: Apple's Native Approach

While GGUF was designed for cross-platform inference, MLX was built from the ground up for Apple Silicon's unique architecture. The two approaches reflect fundamentally different design philosophies.

The MLX Directory Structure

MLX models are not a single file — they're a directory containing multiple component files:

config.json: Model architecture details (number of layers, heads, hidden size, etc.)
model.safetensors or *.safetensors shards: The quantized weight matrices in SafeTensor format (a zero-copy, memory-mapped format designed for tensor data)
tokenizer.json: Tokenizer vocabulary and special tokens configuration
tokenizer_config.json: Chat template and tokenizer metadata
quantization_config.json: Quantization parameters used during conversion

This structure is less portable than GGUF's single file, but it's optimized for how MLX loads and executes models on Apple hardware.

SafeTensors: The Storage Backend

MLX uses the SafeTensors format for weight storage. SafeTensors is designed for zero-copy loading: the tensor metadata is at the start of the file, allowing the reader to map tensors directly from disk into memory without intermediate copies.

This is particularly important for Apple Silicon because:

Memory bandwidth is the primary bottleneck for LLM inference
Zero-copy loading means the GPU can read weights directly from unified memory
No CPU overhead copying data between different memory spaces

Apple Silicon Optimization Philosophy

MLX treats unified memory as the primary design constraint. Unlike NVIDIA-style systems where GPU and CPU have separate memory pools connected by PCIe, Apple Silicon's unified memory is a single pool shared between CPU and GPU cores.

MLX's architecture:

No memory copies: Models exist in one memory space; CPU and GPU access the same data
Lazy evaluation: Computation graphs are built and optimized before execution (similar to JAX)
Kernel fusion: Multiple operations are fused into single GPU kernels, reducing memory passes
Hardware-tuned kernels: The compute kernels are written and optimized specifically for Apple's GPU architecture, not ported from CUDA or x86 code paths

mlx-lm: The Inference Engine

MLX's LLM inference is provided by the mlx-lm package. The core conversion command is:

mlx_lm.convert --hf-path Qwen/Qwen2.5-72B-Instruct -q --upload-repo mlx-community/...

This converts a HuggingFace model (typically in SafeTensors format) into MLX format, optionally quantizing it at the same time. The output is a directory that can be served via mlx_lm.server or used directly with the Python API.

ℹ️ The mlx-community Ecosystem Apple and the community maintain a mlx-community organization on HuggingFace with over 3,000 pre-converted, pre-quantized models. This includes most popular open models: Llama 3, Mistral, Qwen, Phi, Gemma, and more. Conversion and quantization can be done locally in seconds, or you can just pull from the community repo.

MLX Quantization: Affine Group Quantization

MLX uses a different quantization approach than GGUF, optimized for Apple Silicon's memory architecture and compute characteristics.

The Affine Quantization Formula

MLX's quantization uses affine group quantization, where a group of weights shares a scale and bias:

Quantized = round(w / scale + bias)

Where:

w is the original weight value
scale is a scalar for the group
bias (or zero_point) is a shift value

Group Size: 64 Weights per Group

MLX's default configuration is:

Group size: 64 weights per quantization group (this can be adjusted in mlx.core.quantize)
Bits: 4 bits per weight by default for standard 4-bit quantization
Mode: "affine" (other modes: "mxfp4", "mxfp8", "nvfp4")

This means every 64 weights in a row share a single scale and bias value. This is a larger group size than GGUF's typical 32 or 256, which affects both quality and compression ratio.

⚖️ Affine vs. GGUF's Block Quantization GGUF's 32-weight blocks provide finer-grained adaptation to local weight distributions. MLX's 64-weight groups are larger, trading some accuracy for reduced metadata overhead and better alignment with Apple's memory access patterns.

Supported Quantization Modes

MLX supports multiple quantization modes:

affine: Standard affine quantization — the default, good balance of speed and quality
mxfp4: Microsoft's mixed-precision FP4 format, optimized for neural networks
mxfp8: Mixed-precision FP8, better quality at higher bitwidth
nvfp4: NVIDIA's FP4 format, similar to MXFP4 but with NVIDIA-specific tuning

The Quantization Call

Programmatic quantization in MLX:

import mlx.core as mx

# Quantize a weight matrix
quantized = mx.quantize(w, group_size=64, bits=4, mode="affine")

# Dequantize back to approximate FP16
dequantized = mx.dequantize(quantized, group_size=64, bits=4, mode="affine")

The requirement that the last dimension be divisible by group_size means that some models may need padding. This is typically handled automatically during the mlx_lm.convert process.

Metal-Native Kernels

Most importantly, MLX's quantized kernels are written specifically for Apple GPU architecture. Unlike llama.cpp which was ported from CPU-first code to Metal, MLX's kernels are designed from day one to leverage Apple's memory hierarchy and compute units.

This results in:

Faster kernel launch overhead
Better utilization of Apple's memory bandwidth
Optimized tensor cores for Apple's GPU architecture
No CPU-side overhead for quantization/dequantization — all on GPU

💡 MLX 4-Bit: The Quality Sweet Spot MLX's 4-bit affine quantization produces quality comparable to GGUF's Q4_K_M, but with slightly better generation throughput on Apple Silicon due to the native Metal kernels.

Ecosystem: Model Availability and Community

Beyond the technical differences, the two formats have vastly different ecosystems, which directly impacts what models you can run and how quickly.

GGUF Ecosystem

Model availability: 40+ architectures supported, with new model releases getting GGUF conversions within 24 hours. Almost every open model has GGUF versions on HuggingFace.

Community size: The llama.cpp project has a large, mature community. The r/LocalLLaMA subreddit has hundreds of thousands of members sharing quantized models, benchmarks, and troubleshooting tips.

Cross-platform: GGUF runs on any hardware: NVIDIA GPU, AMD GPU, Apple Silicon, or pure CPU. This means the same model file works everywhere.

Tooling: GGUF is the backbone of multiple inference servers: llama.cpp, Ollama, LM Studio, llamafile, and more. This creates a rich ecosystem of tools and integrations.

📦 GGUF Model Availability Search HuggingFace for "GGUF" and you'll find every popular model: Llama 3, Llama 3.1, Llama 3.2, Llama 3.3, Llama 4, Llama 4.1, Mistral, Mixtral, Qwen2, Qwen3, Qwen3.5, Gemma, Gemma2, Phi, Phi-3, Phi-4, Falcon, and more. Each model typically has multiple quantization levels from Q2_K to Q8_0.

MLX Ecosystem

Model availability: The mlx-community organization on HuggingFace has 3,000+ pre-converted models. This is a growing but smaller selection compared to GGUF.

Community: Active, Apple-backed community with regular updates. New model conversions typically appear within days of a model's release, not hours like GGUF.

Apple Silicon only: MLX models only run on Apple Silicon. This is a trade-off for the performance optimization — you get better speed, but only on one platform.

Conversion tooling: Converting a model locally is straightforward with mlx_lm.convert. If a model isn't in the mlx-community repo, you can convert and quantize it yourself in seconds.

Model Size and Format Comparison

Model	Quantization	GGUF Size	MLX Size	Notes
Qwen2.5-72B	Q4_K_M / 4-bit	~40 GB (GGUF)	~35 GB (MLX)	MLX is slightly smaller due to different compression
Llama3-70B	Q4_K_M / 4-bit	~41 GB (GGUF)	~36 GB (MLX)	Similar compression ratio
Mistral-8x7B	Q4_K_M / 4-bit	~24 GB (GGUF)	~21 GB (MLX)	Both formats efficient on sparse models
Phi-3.5-mini	Q4_K_M / 4-bit	~3 GB (GGUF)	~2.5 GB (MLX)	Small models have minimal format overhead
Qwen3-32B	Q4_K_M / 4-bit	~19 GB (GGUF)	~17 GB (MLX)	Similar compression for mid-sized models

MLX models tend to be slightly smaller for the same quantization level, but the difference is marginal (2-5%). The real ecosystem difference is availability and hardware support, not file size.

Performance: When Each Format Wins

This is where the rubber meets the road. The choice between GGUF and MLX isn't just about file formats — it's about which one delivers better performance for your specific workload.

Benchmark Reality: M1 Max Mac Studio

Benchmarks on the same hardware (M1 Max Mac Studio) with the same model (Qwen3.5-35B-A3B) reveal a stark pattern:

Engine	Quantization	Generation Speed	Prefill (1K tokens)	Total (1K in, 200 out)
MLX 4-bit	affine, 4-bit	57 tok/s	15–20s	~19s
GGUF Q4_K_M	K-Quants	29 tok/s	3–5s	~11s
oMLX	SSD KV cache	~55 tok/s	~1.7s @ 8K	~5s

The headline numbers tell a misleading story. MLX wins generation speed (57 tok/s vs 29 tok/s), but GGUF wins end-to-end on typical workloads because of its superior prefill latency.

Understanding Prefill vs. Generation

LLM inference has two distinct phases:

1. Prefill (Input Processing)

This is where the input prompt is processed in parallel. All tokens are computed simultaneously, which is compute-intensive but fast. The output is a KV cache (key-value pairs for attention) plus the first output token.

Prefill latency is the time from sending the prompt to receiving the first output token. This dominates user-perceived latency for anything longer than a few hundred tokens.

2. Generation (Decode Phase)

This is where output tokens are generated one at a time. Each new token requires loading the model weights from memory, computing the next token, and repeating. This is memory-bandwidth bound, not compute-bound.

Generation speed is measured in tokens per second (tok/s). This is what most benchmarks report, but it's the wrong metric for most real-world use cases.

⚠️ The Prefill Problem with MLX MLX's prefill latency grows badly with context length. For prompts over 1K tokens, MLX can take 15-20 seconds before showing any output, while GGUF delivers the first token in 3-5 seconds. For agentic workloads with long system prompts, this makes MLX feel broken despite its higher tok/s number.

When MLX Wins

MLX is the better choice when:

Short-context workloads: Prompts under 1K tokens where prefill latency is negligible
Maximum generation throughput: Tasks where you're generating thousands of tokens (e.g., long-form content generation, summarization)
Brief interaction patterns: Single queries with long outputs, not conversational loops
M3/M4/M5 hardware: Newer Apple Silicon benefits more from MLX's Metal-native kernels

When GGUF Wins

GGUF is the better choice when:

Long-context workloads: Prompts over 2K tokens where prefill latency dominates
Agentic workloads: Tool use, JSON history, or system prompts of 1K+ tokens
Conversational use: Interactive chat where time-to-first-token matters more than total throughput
Cross-platform needs: If you need to run the same model on different hardware

The oMLX Bridge

There's a third option: oMLX (tiered KV cache with SSD persistence). It achieves MLX-level generation speeds (55 tok/s) while reducing prefill to near GGUF levels (1.7s at 8K context). This is experimental but worth considering for long-context, high-throughput workloads.

Decision Framework: Which to Use When

Here's a practical decision framework to help you choose the right format for your use case.

General-Purpose Use

For most workloads, especially if you're just starting out:

Format: GGUF Q4_K_M
Why: Best balance of speed, quality, and availability. Works on any hardware, has the widest model selection, and the prefill performance feels snappier for typical use cases.
Tooling: Ollama, LM Studio, or llama.cpp directly

Agentic Workloads (Long Context)

For tool use, RAG systems, or any workload with system prompts over 1K tokens:

Format: GGUF Q4_K_M or oMLX
Why: The prefill latency matters more than generation speed here. A 4K-token system prompt processed in 3-5 seconds (GGUF) vs 15-20 seconds (MLX) makes MLX feel unusable even though its tok/s number is higher.
Tooling: llama.cpp server or oMLX for best performance

Maximum Throughput (Short Context)

For batch processing or tasks where you need to generate thousands of tokens per query:

Format: MLX 4-bit
Why: The 57 tok/s generation speed will save significant total time, especially for tasks like long-form content generation or summarization pipelines.
Tooling: Direct mlx-lm server, not through a wrapper like Ollama

Model Availability Priority

If you need the newest models immediately or need obscure architectures:

Format: GGUF
Why: New model releases get GGUF conversions within hours. MLX conversions typically take days. The GGUF ecosystem also supports 40+ architectures vs MLX's more limited (though growing) selection.

What to Avoid

⚠️ Don't Use I-Quants on Apple Silicon I-Quants use lookup tables for dequantization, which makes them significantly slower on Apple Silicon — 50% slower than K-Quants despite similar quality. The extra memory access is CPU-bound and doesn't leverage Apple GPU efficiently. Use K-Quants instead.

Practical Recommendation Summary

Use Case	Recommended Format	Why
General-purpose, chat, exploration	GGUF Q4_K_M	Best balance, widest model availability, snappy prefill
Agentic, RAG, long system prompts	GGUF Q4_K_M	Prefill latency critical for usability
Long-form content generation	MLX 4-bit	Generation speed dominates total time
Brief queries with massive outputs	MLX 4-bit	High tok/s matters when generating thousands of tokens
Long-context, high-throughput	oMLX	MLX speeds with GGUF-like prefill latency
Cross-platform deployment	GGUF	Runs on NVIDIA, AMD, Apple, or CPU

💡 The Bottom Line For most Apple Silicon users running local LLMs, GGUF Q4_K_M is the right choice. Use MLX 4-bit only when you specifically need maximum generation throughput for short-context tasks where prefill speed is less critical. Avoid I-Quants on Apple Silicon — K-Quants are faster and nearly as accurate.

References

GGUF Format Internals (Medium, Dec 2025) — Comprehensive technical deep dive into GGUF structure, quantization families, and the math behind block-based quantization. ↗ Medium
r/LocalLLaMA — GGUF Quant Methods — Community insights on K-Quants vs I-Quants, performance on Apple Silicon, and practical recommendations for Mac users. ↗ Reddit
MLX Official Documentation (v0.31.1) — Apple's official docs on MLX quantization, affine group quantization, and model conversion workflow. ↗ MLX Docs
mlx-community on HuggingFace — 3,000+ pre-converted MLX models with detailed model cards and quantization information. ↗ HuggingFace
famstack.dev — "57 tok/s on Screen, 3 tok/s in Practice: MLX vs llama.cpp on Apple Silicon" (Mar 2026) — Detailed real-world benchmark on M1 Max Mac Studio covering MLX prefill problem, K-Quants vs I-Quants performance, and oMLX tiered KV cache results. ↗ famstack.dev
r/LocalLLaMA — K-Quant Suffixes Explained — Explanation of _XS, _S, _M, _L suffixes and importance matrix usage for optimal quantization. ↗ Reddit