You've got a Mac Studio M3 Ultra with 256GB of unified memory. You fire up Ollama, pull a 70B model, and watch it serve tokens at 18 tok/s. Meanwhile you read a blog post claiming 57 tok/s on the same hardware. What's going on?
The answer involves three intersecting layers: Apple's MLX framework, the model format you're using (GGUF vs MLX), and the inference server sitting in front of it all. Understanding how these interact is the difference between a sluggish local LLM and one that actually delivers.
This guide covers all three layers โ how MLX works at the hardware level, where Ollama's performance ceiling comes from, what the benchmark numbers actually mean for real workloads, and a practical optimization checklist for Mac Studio users.
What Is MLX and Why Apple Built It
MLX is an open-source array computation framework developed by Apple, released in late 2023, and designed from the ground up for Apple Silicon hardware. It is not a general-purpose ML framework like PyTorch or TensorFlow. It's a hardware-native framework that treats Apple Silicon's unique architecture as a first-class constraint rather than an afterthought.[3]
To understand why Apple built MLX, you need to understand what's wrong with the alternatives:
- PyTorch MPS: Adapts CUDA-style operations to Metal. It's a translation layer โ designed for NVIDIA-style GPU memory separation, ported to Apple's unified memory model. The abstraction leaks. Memory constraints hit large models early, and long-context behavior degrades badly.[2]
- llama.cpp (via Ollama): Written in C++, cross-platform, uses GGUF format. Metal support was added as a backend, and it works well โ but it's optimized for the general case, not for Apple Silicon's specific memory bandwidth and Neural Engine characteristics.
- TensorFlow: No real Metal support for inference. Effectively dead on Apple Silicon for LLM work.
MLX fills the gap. Its key design principles:
- Unified memory as a first-class citizen: CPU and GPU share the same memory pool. MLX operations can run on either compute unit without any memory copy. There is no
model.to('cuda')equivalent โ the model is already accessible to both. - Lazy evaluation with graph optimization: MLX builds a computation graph and optimizes it before execution, similar to JAX. This enables kernel fusion and avoids redundant memory passes.
- NumPy-compatible API: Low barrier for anyone coming from Python scientific computing. The neural net (
mlx.nn) and optimizer (mlx.optimizers) packages follow familiar conventions. - Hardware-specific kernels: The compute kernels are written and tuned for Apple's GPU architecture, not adapted from x86 or NVIDIA paths.
For LLM inference specifically, the package is mlx-lm. Install it with pip install mlx-lm and you can run most HuggingFace models directly:
# Chat with a local MLX model
mlx_lm.chat --model mlx-community/Qwen2.5-72B-Instruct-4bit
# Convert and quantize a HuggingFace model to MLX format
mlx_lm.convert --hf-path Qwen/Qwen2.5-72B-Instruct -q --upload-repo mlx-community/...
The mlx-community organization on HuggingFace maintains pre-converted, pre-quantized versions of most popular models. This is where most users start.
How Unified Memory Changes Everything
Unified memory is the single most important architectural difference between Apple Silicon and every other platform you might run LLMs on. Understanding it physically โ not just as a marketing phrase โ matters for optimization.
The Traditional GPU Memory Model
On a typical x86 + NVIDIA system, GPU and CPU have separate memory pools connected by PCIe. A 72B model at 4-bit quantization requires roughly 36GB of weights. To run it on an NVIDIA GPU with 48GB VRAM, the weights must be:
- Loaded from disk into system RAM
- Copied over PCIe bus to VRAM (PCIe 5.0 x16 peaks at ~64 GB/s, typically 20โ40 GB/s effective)
- Accessed by the GPU during inference
For models that exceed VRAM, weights spill to system RAM and are pulled over PCIe on demand. This is catastrophically slow for LLM inference โ you'll see sub-1 tok/s on "offloaded" NVIDIA setups.
The Apple Silicon Model
On Apple Silicon, there is one memory pool. The M3 Ultra's 256GB is accessible to both the CPU cores and the GPU cores at the same bandwidth โ up to 800 GB/s on the Ultra. The GPU is not a separate device; it's a separate set of compute units on the same die, sharing the same memory controller.
The practical implications:
- No PCIe bottleneck: Moving data between CPU and GPU computations costs nothing โ it's the same memory.
- Model capacity = total RAM: A 256GB Mac Studio can load a 200GB+ model in full precision. No NVIDIA GPU setup comes close at that price point.
- Memory bandwidth dominates inference throughput: LLM token generation is memory-bandwidth-bound, not compute-bound. At 800 GB/s, the M3 Ultra can move model weights to GPU compute units extremely fast. This is where Apple Silicon's tok/s advantage comes from.
- Prefill is still compute-bound: Processing the input prompt (prefill) involves large matrix multiplications, which are compute-intensive. This is where Apple Silicon's lack of dedicated tensor cores (compared to NVIDIA's A100/H100) shows up as TTFT latency.
vm_stat for swap activity before benchmarking.
| Hardware | Memory Bandwidth | Max Model Size | CPUโGPU Transfer |
|---|---|---|---|
| M3 Ultra (256GB) | 800 GB/s | ~230GB (4-bit 72B = 36GB) | Zero-copy |
| M3 Max (128GB) | 400 GB/s | ~115GB | Zero-copy |
| M4 Max (128GB) | 546 GB/s | ~115GB | Zero-copy |
| NVIDIA RTX 4090 (24GB) | 1,008 GB/s | 24GB (no overflow) | PCIe 4.0 (32 GB/s) |
| NVIDIA A100 (80GB) | 2,000 GB/s | 80GB (no overflow) | PCIe 4.0/NVLink |
The RTX 4090 wins on raw bandwidth and absolute throughput for models that fit in 24GB. The Mac Studio wins for model capacity and the ability to run large models without performance cliffs. The M3 Ultra at 800 GB/s is genuinely competitive for token generation on 70B+ models.
The Model Format Battle: MLX vs GGUF
When people talk about "using MLX" vs "using Ollama," they're often conflating two different things: the inference engine and the model format. They're related but separable.
GGUF (llama.cpp's format)
GGUF is the cross-platform model format used by llama.cpp, Ollama, LM Studio's llama.cpp engine, and most other inference servers. Key properties:
- Self-contained: weights, metadata, and tokenizer in one file
- Multiple quantization levels: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16
- Runs on any platform: NVIDIA, AMD, Apple, CPU-only
- Wide ecosystem: almost every new open model gets a GGUF release within 24 hours
MLX Format (.safetensors + .npz)
MLX models use a combination of SafeTensors and compressed NumPy (.npz) files. They're not a single-file format โ they're a directory containing weights, config, and tokenizer files in a layout that MLX's runtime loads directly.
- Apple Silicon only: no portability to other hardware
- Quantization via
mlx_lm.convert: supports 4-bit and 8-bit - Lazy loading: MLX loads weights on-demand, which can affect cold-start latency
- Hardware-optimized kernels: the compute path is tuned specifically for Apple GPU architecture
When Each Format Wins
| Scenario | Winner | Why |
|---|---|---|
| Max generation throughput (tok/s) | MLX 4-bit | Hardware-tuned kernels, better memory bandwidth utilization |
| Short prompts, quick responses | GGUF Q4_K_M | Lower TTFT, faster prefill for typical prompt sizes |
| Long-context agentic workloads (>4K tokens) | GGUF + oMLX | MLX prefill latency grows badly with context; GGUF or oMLX cache handles this better |
| Model availability | GGUF | Wider selection, faster community releases |
| Portability across machines | GGUF | Single file, runs everywhere |
| Quality at same size | Roughly equal | Both Q4/4-bit quantizations produce similar output quality |
Ollama on Apple Silicon: Strengths and Hidden Costs
Ollama is the default recommendation for running local LLMs โ and for good reason. Its developer ergonomics are exceptional: a single CLI, automatic model management, an OpenAI-compatible REST API, and zero configuration. For most use cases, it just works.
But Ollama has a specific performance profile on Apple Silicon that you need to understand before you benchmark it against alternatives.
How Ollama Works on Mac
Ollama is written in Go. On Apple Silicon, it shells out to llama.cpp with Metal GPU acceleration enabled. The pipeline is:
Ollama API request (Go)
โ llama.cpp runner process
โ Metal GPU backend
โ Apple Silicon GPU (M3 Ultra cores)
This is important: Ollama is a wrapper around llama.cpp, not a native inference engine. Every API request goes through Go's HTTP server, gets handed to a C++ subprocess, which manages the Metal compute context.
The 38% Go Wrapper Tax
Real-world benchmarks on an M1 Max Mac Studio (Qwen3.5-35B-A3B, Q4_K_M) showed a significant gap between Ollama and LM Studio running the same GGUF model with the same llama.cpp engine:[5]
- LM Studio (llama.cpp direct): 29 tok/s
- Ollama (llama.cpp wrapper): 18 tok/s
That's a 38% throughput penalty for Ollama over bare llama.cpp โ purely from the Go wrapper's overhead and process isolation model. The inference engine is identical; only the wrapper differs.
This overhead comes from several sources:
- Go HTTP server latency per token streaming event
- IPC between the Go process and the llama.cpp runner subprocess
- JSON serialization/deserialization at the API boundary
- Model management overhead (Ollama handles model loading/unloading logic in Go)
Where Ollama Still Wins
Despite the throughput penalty, Ollama has real advantages:
- OpenAI-compatible API: Works with LangChain, OpenClaw, Continue.dev, and every other tool that speaks the OpenAI API format โ zero config.
- Model management:
ollama pull,ollama list, automatic version tracking. Far easier than managing GGUF files manually. - Multi-model serving: Ollama can hold multiple models in memory and switch between them on demand.
- Modelfiles: System prompt configuration, parameter defaults, and model metadata in a simple declarative format.
- Keep-alive: Models stay loaded between requests, eliminating cold-start latency for repeated calls.
OLLAMA_NEW_ENGINE=1 OLLAMA_BACKEND=mlx ollama serve, it will use Apple's MLX inference engine instead of llama.cpp. This could eliminate the GGUF-to-MLX format gap and give Ollama users native MLX throughput while keeping all of Ollama's ergonomic advantages. As of early 2026, the PR is in active development.[6]
Benchmark Reality Check
The most dangerous number in local LLM benchmarking is the headline tok/s figure. Here's why it lies, and what to measure instead.
Generation Speed vs. Time-to-First-Token
LLM inference has two distinct phases:
- Prefill: Process the input prompt. All input tokens are computed in parallel โ this is the compute-intensive phase. Time: linear in prompt length. Output: the KV cache + first output token.
- Decode (generation): Generate output tokens one at a time, each conditioned on previous tokens. This is memory-bandwidth-bound โ the bottleneck is how fast you can load the model weights from memory for each new token.
When benchmarks report "57 tok/s," they're measuring decode speed on short prompts. The prefill latency is often not reported โ but it completely dominates real-world experience for anything longer than a few hundred tokens.
The MLX Prefill Problem
Empirical benchmarks on M1 Max Mac Studio with Qwen3.5-35B-A3B reveal a stark pattern:[5]
| Engine | Generation Speed | Prefill (1K tokens) | Total (1K in, 200 out) |
|---|---|---|---|
| MLX 4-bit (LM Studio) | 57 tok/s | ~15โ20s | ~19s |
| GGUF Q4_K_M (LM Studio) | 29 tok/s | ~3โ5s | ~11s |
| Ollama (GGUF) | 18 tok/s | ~3โ5s | ~16s |
| oMLX (tiered KV cache) | ~55 tok/s | ~1.7s at 8K | ~5s |
MLX wins generation. GGUF wins end-to-end on longer contexts. oMLX changes the equation entirely.
What Actually Matters for Your Workload
For interactive chat (short prompts, conversational context): Generation speed matters most. MLX or GGUF Q4_K_M through LM Studio are both good choices. Ollama is fine if ergonomics matter more than the last 40% of throughput.
For agentic workloads (tool use, JSON history, long system prompts): Prefill latency dominates. A 4K-token prompt processed at 15โ20 seconds (MLX) vs 3โ5 seconds (GGUF) makes MLX feel broken even though its tok/s number is higher. Use GGUF Q4_K_M or oMLX for this use case.
For batch processing (document classification, summarization pipelines): Total throughput matters. MLX may win here โ the high generation speed compensates for prefill cost when running many short-output tasks. Use the MLX server directly (mlx_lm.server) rather than going through a wrapper.
When to Use MLX vs Ollama vs llama.cpp
This is a decision framework, not a ranking. Each option wins in different scenarios.
Use MLX Directly (mlx_lm.server)
- You need maximum generation throughput for generation-heavy tasks (short prompts, long outputs)
- You're running on M3/M4/M5 hardware where MLX kernels are most optimized
- You're willing to manage model files manually (no
ollama pull) - You don't need multi-model serving
- Context windows are consistently short (<2K tokens)
Use Ollama
- You want OpenAI API compatibility with zero configuration
- You're integrating with existing tools (LangChain, Continue.dev, OpenClaw, etc.)
- You need multi-model serving and easy model management
- Developer ergonomics matter more than raw throughput
- You accept the 38% throughput penalty for convenience
Use llama.cpp Directly (or LM Studio)
- You want the best GGUF performance without Ollama's Go overhead
- You need fine-grained control over GPU layer offloading (
-ngl) - You want the widest model compatibility (every GGUF variant supported)
- LM Studio (which wraps llama.cpp) adds a good UI without much overhead
Use oMLX
- You have long-context workloads that break MLX's prefill performance
- You want MLX generation speeds with GGUF-competitive TTFT
- You're comfortable with an experimental inference server
- The tiered KV cache (SSD-backed) is acceptable for your use case
Optimization Guide for Mac Studio
Practical steps to maximize LLM performance on a Mac Studio M3 Ultra (256GB).
1. Choose the Right Quantization Level
Quantization reduces model size at the cost of some quality. On Apple Silicon, the memory bandwidth benefit of smaller models often outweighs quality loss up to a point:
| Quantization | Size (70B model) | Quality Impact | Recommendation |
|---|---|---|---|
| F16 (16-bit) | ~140GB | None | Use if RAM allows โ baseline quality |
| Q8_0 | ~74GB | Negligible | Good quality/size tradeoff, MLX 8-bit |
| Q4_K_M (GGUF) | ~40GB | Small | Best practical choice for most use cases |
| MLX 4-bit | ~35GB | Small | Best generation speed on Apple Silicon |
| Q3_K_M | ~30GB | Moderate | Use only if you need to fit a larger model |
| Q2_K | ~25GB | Significant | Avoid โ quality degradation is too high |
With 256GB on your Mac Studio, you can comfortably fit a 70B model at Q8_0 (74GB) or even a 200B+ model at 4-bit. Don't over-quantize just to save memory you have.
2. Maximize GPU Layer Offloading (GGUF/Ollama)
For llama.cpp and Ollama, the -ngl (number of GPU layers) parameter determines how much of the model runs on the Metal GPU vs CPU. On Mac Studio, set this to max:
# llama.cpp: offload all layers to GPU
./llama-cli -m model.gguf -ngl 99 -p "your prompt"
# Ollama: set in Modelfile or environment
OLLAMA_NUM_GPU=999 ollama serve
# Or in Modelfile:
FROM qwen2.5:72b
PARAMETER num_gpu 999
On a Mac Studio with 256GB unified memory, there's no reason to have any layers on CPU โ everything fits in memory with bandwidth to spare.
3. Context Window and KV Cache
The KV cache stores attention keys/values for all tokens in the current context. Its memory footprint is:
KV cache size = 2 ร layers ร heads ร head_dim ร context_length ร precision
For a 72B model at 4-bit with a 32K context window, the KV cache alone can consume 8โ16GB. Larger context = more memory pressure = slower inference.
Practical guidelines:
- Set context length to match your actual use case, not the model's maximum
- For interactive chat: 4Kโ8K is usually sufficient
- For document processing: match the document size, not the model maximum
- Monitor KV cache hits with Ollama's
/api/psendpoint โ cache hits dramatically improve repeated-context performance
4. Keep Models Loaded
Cold-start model loading on a 70B model can take 10โ30 seconds. Both Ollama and MLX server keep models in memory between requests by default โ make sure you're not triggering unloads:
# Ollama: set keep-alive to permanent
OLLAMA_KEEP_ALIVE=-1 ollama serve
# Or per-request:
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5:72b",
"keep_alive": -1,
"prompt": "hello"
}'
5. Monitor GPU Utilization
# Real-time GPU and memory stats on macOS
sudo powermetrics --samplers gpu_power -i 1000
# Check unified memory pressure
vm_stat | grep "Pages"
# Ollama model status
curl http://localhost:11434/api/ps | python3 -m json.tool
Watch for GPU utilization dropping below 80% during inference โ this indicates either a memory bandwidth bottleneck (model too large for available bandwidth) or a CPU-bound bottleneck (prefill on a very long prompt).
6. Batch Size and Parallelism
For single-user interactive use, batch size 1 is optimal โ it minimizes latency. For serving multiple concurrent users or batch processing pipelines, increase the batch size:
# MLX server: enable concurrent requests
mlx_lm.server --model mlx-community/Qwen2.5-72B-Instruct-4bit --port 8080
# llama.cpp: parallel sequences
./llama-server -m model.gguf -ngl 99 --parallel 4 --ctx-size 32768
vllm-mlx demonstrates that continuous batching at 16 concurrent requests achieves 4.3x aggregate throughput on the same hardware โ a significant win for multi-user serving scenarios.[2]
Future: Ollama MLX Backend and vllm-mlx
Two developments will significantly change the landscape for Apple Silicon LLM inference in 2026.
Ollama MLX Backend (PR #9118)
Ollama's most requested feature on Apple Silicon is an MLX backend. The PR is live at github.com/ollama/ollama/pull/9118. When complete, this will replace the llama.cpp runner with an MLX inference engine while keeping all of Ollama's API and model management layer intact.
The practical impact: Ollama users would get MLX's generation throughput without giving up ollama pull, the OpenAI API, or multi-model serving. The 38% Go wrapper overhead would likely decrease as the integration matures. To try the early version:
OLLAMA_NEW_ENGINE=1 OLLAMA_BACKEND=mlx ollama serve
Expect instability โ it's early stage. But the direction is clear.
vllm-mlx: Continuous Batching for Apple Silicon
vllm-mlx is a framework built natively on MLX that brings production-grade serving capabilities to Apple Silicon:[2]
- Continuous batching: 4.3x aggregate throughput at 16 concurrent requests vs single-request serving
- Content-based prefix caching: Eliminates redundant vision encoder passes for multimodal models โ 28x speedup on repeated images
- Text throughput: Up to 525 tok/s on M4 Max
- 21โ87% higher throughput than llama.cpp across Qwen3-0.6B to Nemotron-30B
For developers running multi-user inference servers on Mac Studio hardware โ this is the path to production-grade Apple Silicon serving. It's open source and worth watching closely.
oMLX: Solving the Prefill Problem
oMLX (github.com/jundot/omlx) is an MLX inference server with a tiered KV cache that persists to SSD. On the specific benchmark (M1 Max, Qwen3.5-35B-A3B, 8K context):
- Prefill: 49s โ 1.7s (29x speedup)
- Effective throughput: 6 tok/s โ 30 tok/s
- Generation speed: matches LM Studio's MLX engine (~55 tok/s)
The SSD-backed KV cache means that repeated similar prompts (common in agentic loops with a large system prompt) hit the cache rather than recomputing. On M3 Ultra with fast NVMe, this is particularly effective.
References
- Shrivastava et al. โ Production-Grade Local LLM Inference on Apple Silicon (arXiv:2511.05502) โ Systematic benchmark of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS on M2 Ultra Mac Studio. Covers TTFT, throughput, latency percentiles, and long-context behavior. โ arxiv.org
- vllm-mlx โ Native LLM and MLLM Inference at Scale on Apple Silicon (arXiv:2601.19139) โ vllm-mlx framework with continuous batching and multimodal prefix caching. 525 tok/s on M4 Max, 4.3x aggregate throughput at 16 concurrent requests. โ arxiv.org
- Apple ML Research โ Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU โ Official Apple research post on MLX architecture, M5 Neural Accelerators, and MLX LM usage. โ machinelearning.apple.com
- Markus Schall โ Local AI with MLX on the Mac (Nov 2025) โ Practical comparison of MLX, Ollama, llama.cpp, and LM Studio on Apple Silicon. Covers model formats and deployment workflows. โ markus-schall.de
- famstack.dev โ "57 tok/s on Screen, 3 tok/s in Practice: MLX vs llama.cpp on Apple Silicon" (Mar 2026) โ Detailed real-world benchmark on M1 Max Mac Studio. Covers MLX prefill problem, Ollama wrapper overhead (38%), and oMLX tiered KV cache results. โ famstack.dev
- r/ollama โ "Ollama is getting MLX support!" (Mar 2025) โ Community thread covering Ollama's official MLX backend work, PR #9118, and how to try the early version. โ reddit.com