🍎 Apple Silicon MLX & LLM Inference: The Complete Guide

How MLX works, why Ollama's throughput lags on Mac, and a practical optimization guide for running LLMs on Apple Silicon — with real benchmark numbers.

March 17, 2026 · 14 min read

📺 Watch the video version:

🎧

Listen to this article

You've got a Mac Studio M3 Ultra with 256GB of unified memory. You fire up Ollama, pull a 70B model, and watch it serve tokens at 18 tok/s. Meanwhile you read a blog post claiming 57 tok/s on the same hardware. What's going on?

The answer involves three intersecting layers: Apple's MLX framework, the model format you're using (GGUF vs MLX), and the inference server sitting in front of it all. Understanding how these interact is the difference between a sluggish local LLM and one that actually delivers.

This guide covers all three layers — how MLX works at the hardware level, where Ollama's performance ceiling comes from, what the benchmark numbers actually mean for real workloads, and a practical optimization checklist for Mac Studio users.

What Is MLX and Why Apple Built It

MLX is an open-source array computation framework developed by Apple, released in late 2023, and designed from the ground up for Apple Silicon hardware. It is not a general-purpose ML framework like PyTorch or TensorFlow. It's a hardware-native framework that treats Apple Silicon's unique architecture as a first-class constraint rather than an afterthought.^[3]

To understand why Apple built MLX, you need to understand what's wrong with the alternatives:

PyTorch MPS: Adapts CUDA-style operations to Metal. It's a translation layer — designed for NVIDIA-style GPU memory separation, ported to Apple's unified memory model. The abstraction leaks. Memory constraints hit large models early, and long-context behavior degrades badly.^[2]
llama.cpp (via Ollama): Written in C++, cross-platform, uses GGUF format. Metal support was added as a backend, and it works well — but it's optimized for the general case, not for Apple Silicon's specific memory bandwidth and Neural Engine characteristics.
TensorFlow: No real Metal support for inference. Effectively dead on Apple Silicon for LLM work.

MLX fills the gap. Its key design principles:

Unified memory as a first-class citizen: CPU and GPU share the same memory pool. MLX operations can run on either compute unit without any memory copy. There is no model.to('cuda') equivalent — the model is already accessible to both.
Lazy evaluation with graph optimization: MLX builds a computation graph and optimizes it before execution, similar to JAX. This enables kernel fusion and avoids redundant memory passes.
NumPy-compatible API: Low barrier for anyone coming from Python scientific computing. The neural net (mlx.nn) and optimizer (mlx.optimizers) packages follow familiar conventions.
Hardware-specific kernels: The compute kernels are written and tuned for Apple's GPU architecture, not adapted from x86 or NVIDIA paths.

For LLM inference specifically, the package is mlx-lm. Install it with pip install mlx-lm and you can run most HuggingFace models directly:

# Chat with a local MLX model
mlx_lm.chat --model mlx-community/Qwen2.5-72B-Instruct-4bit

# Convert and quantize a HuggingFace model to MLX format
mlx_lm.convert --hf-path Qwen/Qwen2.5-72B-Instruct -q --upload-repo mlx-community/...

The mlx-community organization on HuggingFace maintains pre-converted, pre-quantized versions of most popular models. This is where most users start.

ℹ️ M5 Neural Accelerators The latest M5 chip introduced dedicated Neural Accelerators — hardware units for matrix multiplication that are critical for the prefill phase (processing input tokens). MLX natively targets these units when available, delivering measurably faster time-to-first-token on M5 hardware compared to M3/M4.^[3]

How Unified Memory Changes Everything

Unified memory is the single most important architectural difference between Apple Silicon and every other platform you might run LLMs on. Understanding it physically — not just as a marketing phrase — matters for optimization.

The Traditional GPU Memory Model

On a typical x86 + NVIDIA system, GPU and CPU have separate memory pools connected by PCIe. A 72B model at 4-bit quantization requires roughly 36GB of weights. To run it on an NVIDIA GPU with 48GB VRAM, the weights must be:

Loaded from disk into system RAM
Copied over PCIe bus to VRAM (PCIe 5.0 x16 peaks at ~64 GB/s, typically 20–40 GB/s effective)
Accessed by the GPU during inference

For models that exceed VRAM, weights spill to system RAM and are pulled over PCIe on demand. This is catastrophically slow for LLM inference — you'll see sub-1 tok/s on "offloaded" NVIDIA setups.

The Apple Silicon Model

On Apple Silicon, there is one memory pool. The M3 Ultra's 256GB is accessible to both the CPU cores and the GPU cores at the same bandwidth — up to 800 GB/s on the Ultra. The GPU is not a separate device; it's a separate set of compute units on the same die, sharing the same memory controller.

The practical implications:

No PCIe bottleneck: Moving data between CPU and GPU computations costs nothing — it's the same memory.
Model capacity = total RAM: A 256GB Mac Studio can load a 200GB+ model in full precision. No NVIDIA GPU setup comes close at that price point.
Memory bandwidth dominates inference throughput: LLM token generation is memory-bandwidth-bound, not compute-bound. At 800 GB/s, the M3 Ultra can move model weights to GPU compute units extremely fast. This is where Apple Silicon's tok/s advantage comes from.
Prefill is still compute-bound: Processing the input prompt (prefill) involves large matrix multiplications, which are compute-intensive. This is where Apple Silicon's lack of dedicated tensor cores (compared to NVIDIA's A100/H100) shows up as TTFT latency.

⚠️ Memory Pressure Matters macOS dynamically allocates unified memory between the OS, apps, and ML workloads. If your system is under memory pressure (swap active), inference performance degrades sharply. On a 256GB Mac Studio running large models, close memory-hungry apps and check vm_stat for swap activity before benchmarking.

Hardware	Memory Bandwidth	Max Model Size	CPU↔GPU Transfer
M3 Ultra (256GB)	800 GB/s	~230GB (4-bit 72B = 36GB)	Zero-copy
M3 Max (128GB)	400 GB/s	~115GB	Zero-copy
M4 Max (128GB)	546 GB/s	~115GB	Zero-copy
NVIDIA RTX 4090 (24GB)	1,008 GB/s	24GB (no overflow)	PCIe 4.0 (32 GB/s)
NVIDIA A100 (80GB)	2,000 GB/s	80GB (no overflow)	PCIe 4.0/NVLink

The RTX 4090 wins on raw bandwidth and absolute throughput for models that fit in 24GB. The Mac Studio wins for model capacity and the ability to run large models without performance cliffs. The M3 Ultra at 800 GB/s is genuinely competitive for token generation on 70B+ models.

The Model Format Battle: MLX vs GGUF

When people talk about "using MLX" vs "using Ollama," they're often conflating two different things: the inference engine and the model format. They're related but separable.

GGUF (llama.cpp's format)

GGUF is the cross-platform model format used by llama.cpp, Ollama, LM Studio's llama.cpp engine, and most other inference servers. Key properties:

Self-contained: weights, metadata, and tokenizer in one file
Multiple quantization levels: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16
Runs on any platform: NVIDIA, AMD, Apple, CPU-only
Wide ecosystem: almost every new open model gets a GGUF release within 24 hours

MLX Format (.safetensors + .npz)

MLX models use a combination of SafeTensors and compressed NumPy (.npz) files. They're not a single-file format — they're a directory containing weights, config, and tokenizer files in a layout that MLX's runtime loads directly.

Apple Silicon only: no portability to other hardware
Quantization via mlx_lm.convert: supports 4-bit and 8-bit
Lazy loading: MLX loads weights on-demand, which can affect cold-start latency
Hardware-optimized kernels: the compute path is tuned specifically for Apple GPU architecture

When Each Format Wins

Scenario	Winner	Why
Max generation throughput (tok/s)	MLX 4-bit	Hardware-tuned kernels, better memory bandwidth utilization
Short prompts, quick responses	GGUF Q4_K_M	Lower TTFT, faster prefill for typical prompt sizes
Long-context agentic workloads (>4K tokens)	GGUF + oMLX	MLX prefill latency grows badly with context; GGUF or oMLX cache handles this better
Model availability	GGUF	Wider selection, faster community releases
Portability across machines	GGUF	Single file, runs everywhere
Quality at same size	Roughly equal	Both Q4/4-bit quantizations produce similar output quality

💡 Tip: Start with GGUF Q4_K_M For most workloads, GGUF Q4_K_M through Ollama or LM Studio gives the best balance of throughput, TTFT, and availability. Switch to MLX only when you specifically need maximum generation throughput for short-context, generation-heavy tasks.

Ollama on Apple Silicon: Strengths and Hidden Costs

Ollama is the default recommendation for running local LLMs — and for good reason. Its developer ergonomics are exceptional: a single CLI, automatic model management, an OpenAI-compatible REST API, and zero configuration. For most use cases, it just works.

But Ollama has a specific performance profile on Apple Silicon that you need to understand before you benchmark it against alternatives.

How Ollama Works on Mac

Ollama is written in Go. On Apple Silicon, it shells out to llama.cpp with Metal GPU acceleration enabled. The pipeline is:

Ollama API request (Go)
    → llama.cpp runner process
        → Metal GPU backend
            → Apple Silicon GPU (M3 Ultra cores)

This is important: Ollama is a wrapper around llama.cpp, not a native inference engine. Every API request goes through Go's HTTP server, gets handed to a C++ subprocess, which manages the Metal compute context.

The 38% Go Wrapper Tax

Real-world benchmarks on an M1 Max Mac Studio (Qwen3.5-35B-A3B, Q4_K_M) showed a significant gap between Ollama and LM Studio running the same GGUF model with the same llama.cpp engine:^[5]

LM Studio (llama.cpp direct): 29 tok/s
Ollama (llama.cpp wrapper): 18 tok/s

That's a 38% throughput penalty for Ollama over bare llama.cpp — purely from the Go wrapper's overhead and process isolation model. The inference engine is identical; only the wrapper differs.

This overhead comes from several sources:

Go HTTP server latency per token streaming event
IPC between the Go process and the llama.cpp runner subprocess
JSON serialization/deserialization at the API boundary
Model management overhead (Ollama handles model loading/unloading logic in Go)

Where Ollama Still Wins

Despite the throughput penalty, Ollama has real advantages:

OpenAI-compatible API: Works with LangChain, OpenClaw, Continue.dev, and every other tool that speaks the OpenAI API format — zero config.
Model management: ollama pull, ollama list, automatic version tracking. Far easier than managing GGUF files manually.
Multi-model serving: Ollama can hold multiple models in memory and switch between them on demand.
Modelfiles: System prompt configuration, parameter defaults, and model metadata in a simple declarative format.
Keep-alive: Models stay loaded between requests, eliminating cold-start latency for repeated calls.

ℹ️ Ollama MLX Backend: Coming Soon Ollama has officially started work on an MLX backend (PR #9118). When enabled with OLLAMA_NEW_ENGINE=1 OLLAMA_BACKEND=mlx ollama serve, it will use Apple's MLX inference engine instead of llama.cpp. This could eliminate the GGUF-to-MLX format gap and give Ollama users native MLX throughput while keeping all of Ollama's ergonomic advantages. As of early 2026, the PR is in active development.^[6]

Benchmark Reality Check

The most dangerous number in local LLM benchmarking is the headline tok/s figure. Here's why it lies, and what to measure instead.

Generation Speed vs. Time-to-First-Token

LLM inference has two distinct phases:

Prefill: Process the input prompt. All input tokens are computed in parallel — this is the compute-intensive phase. Time: linear in prompt length. Output: the KV cache + first output token.
Decode (generation): Generate output tokens one at a time, each conditioned on previous tokens. This is memory-bandwidth-bound — the bottleneck is how fast you can load the model weights from memory for each new token.

When benchmarks report "57 tok/s," they're measuring decode speed on short prompts. The prefill latency is often not reported — but it completely dominates real-world experience for anything longer than a few hundred tokens.

The MLX Prefill Problem

Empirical benchmarks on M1 Max Mac Studio with Qwen3.5-35B-A3B reveal a stark pattern:^[5]

Engine	Generation Speed	Prefill (1K tokens)	Total (1K in, 200 out)
MLX 4-bit (LM Studio)	57 tok/s	~15–20s	~19s
GGUF Q4_K_M (LM Studio)	29 tok/s	~3–5s	~11s
Ollama (GGUF)	18 tok/s	~3–5s	~16s
oMLX (tiered KV cache)	~55 tok/s	~1.7s at 8K	~5s

MLX wins generation. GGUF wins end-to-end on longer contexts. oMLX changes the equation entirely.

What Actually Matters for Your Workload

For interactive chat (short prompts, conversational context): Generation speed matters most. MLX or GGUF Q4_K_M through LM Studio are both good choices. Ollama is fine if ergonomics matter more than the last 40% of throughput.

For agentic workloads (tool use, JSON history, long system prompts): Prefill latency dominates. A 4K-token prompt processed at 15–20 seconds (MLX) vs 3–5 seconds (GGUF) makes MLX feel broken even though its tok/s number is higher. Use GGUF Q4_K_M or oMLX for this use case.

For batch processing (document classification, summarization pipelines): Total throughput matters. MLX may win here — the high generation speed compensates for prefill cost when running many short-output tasks. Use the MLX server directly (mlx_lm.server) rather than going through a wrapper.

⚠️ Benchmark Your Actual Workload The specific model (Qwen3.5-35B-A3B has known issues with MLX due to hybrid attention and bf16 weights on M1), hardware generation, and context window size all shift which engine wins. Don't trust generic benchmarks — measure your prompt distribution on your hardware.

When to Use MLX vs Ollama vs llama.cpp

This is a decision framework, not a ranking. Each option wins in different scenarios.

Use MLX Directly (`mlx_lm.server`)

You need maximum generation throughput for generation-heavy tasks (short prompts, long outputs)
You're running on M3/M4/M5 hardware where MLX kernels are most optimized
You're willing to manage model files manually (no ollama pull)
You don't need multi-model serving
Context windows are consistently short (<2K tokens)

Use Ollama

You want OpenAI API compatibility with zero configuration
You're integrating with existing tools (LangChain, Continue.dev, OpenClaw, etc.)
You need multi-model serving and easy model management
Developer ergonomics matter more than raw throughput
You accept the 38% throughput penalty for convenience

Use llama.cpp Directly (or LM Studio)

You want the best GGUF performance without Ollama's Go overhead
You need fine-grained control over GPU layer offloading (-ngl)
You want the widest model compatibility (every GGUF variant supported)
LM Studio (which wraps llama.cpp) adds a good UI without much overhead

Use oMLX

You have long-context workloads that break MLX's prefill performance
You want MLX generation speeds with GGUF-competitive TTFT
You're comfortable with an experimental inference server
The tiered KV cache (SSD-backed) is acceptable for your use case

Optimization Guide for Mac Studio

Practical steps to maximize LLM performance on a Mac Studio M3 Ultra (256GB).

1. Choose the Right Quantization Level

Quantization reduces model size at the cost of some quality. On Apple Silicon, the memory bandwidth benefit of smaller models often outweighs quality loss up to a point:

Quantization	Size (70B model)	Quality Impact	Recommendation
F16 (16-bit)	~140GB	None	Use if RAM allows — baseline quality
Q8_0	~74GB	Negligible	Good quality/size tradeoff, MLX 8-bit
Q4_K_M (GGUF)	~40GB	Small	Best practical choice for most use cases
MLX 4-bit	~35GB	Small	Best generation speed on Apple Silicon
Q3_K_M	~30GB	Moderate	Use only if you need to fit a larger model
Q2_K	~25GB	Significant	Avoid — quality degradation is too high

With 256GB on your Mac Studio, you can comfortably fit a 70B model at Q8_0 (74GB) or even a 200B+ model at 4-bit. Don't over-quantize just to save memory you have.

2. Maximize GPU Layer Offloading (GGUF/Ollama)

For llama.cpp and Ollama, the -ngl (number of GPU layers) parameter determines how much of the model runs on the Metal GPU vs CPU. On Mac Studio, set this to max:

# llama.cpp: offload all layers to GPU
./llama-cli -m model.gguf -ngl 99 -p "your prompt"

# Ollama: set in Modelfile or environment
OLLAMA_NUM_GPU=999 ollama serve

# Or in Modelfile:
FROM qwen2.5:72b
PARAMETER num_gpu 999

On a Mac Studio with 256GB unified memory, there's no reason to have any layers on CPU — everything fits in memory with bandwidth to spare.

3. Context Window and KV Cache

The KV cache stores attention keys/values for all tokens in the current context. Its memory footprint is:

KV cache size = 2 × layers × heads × head_dim × context_length × precision

For a 72B model at 4-bit with a 32K context window, the KV cache alone can consume 8–16GB. Larger context = more memory pressure = slower inference.

Practical guidelines:

Set context length to match your actual use case, not the model's maximum
For interactive chat: 4K–8K is usually sufficient
For document processing: match the document size, not the model maximum
Monitor KV cache hits with Ollama's /api/ps endpoint — cache hits dramatically improve repeated-context performance

4. Keep Models Loaded

Cold-start model loading on a 70B model can take 10–30 seconds. Both Ollama and MLX server keep models in memory between requests by default — make sure you're not triggering unloads:

# Ollama: set keep-alive to permanent
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Or per-request:
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:72b",
  "keep_alive": -1,
  "prompt": "hello"
}'

5. Monitor GPU Utilization

# Real-time GPU and memory stats on macOS
sudo powermetrics --samplers gpu_power -i 1000

# Check unified memory pressure
vm_stat | grep "Pages"

# Ollama model status
curl http://localhost:11434/api/ps | python3 -m json.tool

Watch for GPU utilization dropping below 80% during inference — this indicates either a memory bandwidth bottleneck (model too large for available bandwidth) or a CPU-bound bottleneck (prefill on a very long prompt).

6. Batch Size and Parallelism

For single-user interactive use, batch size 1 is optimal — it minimizes latency. For serving multiple concurrent users or batch processing pipelines, increase the batch size:

# MLX server: enable concurrent requests
mlx_lm.server --model mlx-community/Qwen2.5-72B-Instruct-4bit --port 8080

# llama.cpp: parallel sequences
./llama-server -m model.gguf -ngl 99 --parallel 4 --ctx-size 32768

vllm-mlx demonstrates that continuous batching at 16 concurrent requests achieves 4.3x aggregate throughput on the same hardware — a significant win for multi-user serving scenarios.^[2]

Future: Ollama MLX Backend and vllm-mlx

Two developments will significantly change the landscape for Apple Silicon LLM inference in 2026.

Ollama MLX Backend (PR #9118)

Ollama's most requested feature on Apple Silicon is an MLX backend. The PR is live at github.com/ollama/ollama/pull/9118. When complete, this will replace the llama.cpp runner with an MLX inference engine while keeping all of Ollama's API and model management layer intact.

The practical impact: Ollama users would get MLX's generation throughput without giving up ollama pull, the OpenAI API, or multi-model serving. The 38% Go wrapper overhead would likely decrease as the integration matures. To try the early version:

OLLAMA_NEW_ENGINE=1 OLLAMA_BACKEND=mlx ollama serve

Expect instability — it's early stage. But the direction is clear.

vllm-mlx: Continuous Batching for Apple Silicon

vllm-mlx is a framework built natively on MLX that brings production-grade serving capabilities to Apple Silicon:^[2]

Continuous batching: 4.3x aggregate throughput at 16 concurrent requests vs single-request serving
Content-based prefix caching: Eliminates redundant vision encoder passes for multimodal models — 28x speedup on repeated images
Text throughput: Up to 525 tok/s on M4 Max
21–87% higher throughput than llama.cpp across Qwen3-0.6B to Nemotron-30B

For developers running multi-user inference servers on Mac Studio hardware — this is the path to production-grade Apple Silicon serving. It's open source and worth watching closely.

oMLX: Solving the Prefill Problem

oMLX (github.com/jundot/omlx) is an MLX inference server with a tiered KV cache that persists to SSD. On the specific benchmark (M1 Max, Qwen3.5-35B-A3B, 8K context):

Prefill: 49s → 1.7s (29x speedup)
Effective throughput: 6 tok/s → 30 tok/s
Generation speed: matches LM Studio's MLX engine (~55 tok/s)

The SSD-backed KV cache means that repeated similar prompts (common in agentic loops with a large system prompt) hit the cache rather than recomputing. On M3 Ultra with fast NVMe, this is particularly effective.

References

Shrivastava et al. — Production-Grade Local LLM Inference on Apple Silicon (arXiv:2511.05502) — Systematic benchmark of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS on M2 Ultra Mac Studio. Covers TTFT, throughput, latency percentiles, and long-context behavior. ↗ arxiv.org
vllm-mlx — Native LLM and MLLM Inference at Scale on Apple Silicon (arXiv:2601.19139) — vllm-mlx framework with continuous batching and multimodal prefix caching. 525 tok/s on M4 Max, 4.3x aggregate throughput at 16 concurrent requests. ↗ arxiv.org
Apple ML Research — Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU — Official Apple research post on MLX architecture, M5 Neural Accelerators, and MLX LM usage. ↗ machinelearning.apple.com
Markus Schall — Local AI with MLX on the Mac (Nov 2025) — Practical comparison of MLX, Ollama, llama.cpp, and LM Studio on Apple Silicon. Covers model formats and deployment workflows. ↗ markus-schall.de
famstack.dev — "57 tok/s on Screen, 3 tok/s in Practice: MLX vs llama.cpp on Apple Silicon" (Mar 2026) — Detailed real-world benchmark on M1 Max Mac Studio. Covers MLX prefill problem, Ollama wrapper overhead (38%), and oMLX tiered KV cache results. ↗ famstack.dev
r/ollama — "Ollama is getting MLX support!" (Mar 2025) — Community thread covering Ollama's official MLX backend work, PR #9118, and how to try the early version. ↗ reddit.com

🍎 Apple Silicon MLX & LLM Inference: The Complete Guide

What Is MLX and Why Apple Built It

How Unified Memory Changes Everything

The Traditional GPU Memory Model

The Apple Silicon Model

The Model Format Battle: MLX vs GGUF

GGUF (llama.cpp's format)

MLX Format (.safetensors + .npz)

When Each Format Wins

Ollama on Apple Silicon: Strengths and Hidden Costs

How Ollama Works on Mac

The 38% Go Wrapper Tax

Where Ollama Still Wins

Benchmark Reality Check

Generation Speed vs. Time-to-First-Token

The MLX Prefill Problem

What Actually Matters for Your Workload

When to Use MLX vs Ollama vs llama.cpp

Use MLX Directly (mlx_lm.server)

Use Ollama

Use llama.cpp Directly (or LM Studio)

Use oMLX

Optimization Guide for Mac Studio

1. Choose the Right Quantization Level

2. Maximize GPU Layer Offloading (GGUF/Ollama)

3. Context Window and KV Cache

4. Keep Models Loaded

5. Monitor GPU Utilization

6. Batch Size and Parallelism

Future: Ollama MLX Backend and vllm-mlx

Ollama MLX Backend (PR #9118)

vllm-mlx: Continuous Batching for Apple Silicon

oMLX: Solving the Prefill Problem

References

Use MLX Directly (`mlx_lm.server`)