๐Ÿ“บ Watch the video version: ThinkSmart.Life/youtube
๐ŸŽง
Listen to this article

You've got a Mac Studio M3 Ultra with 256GB of unified memory. You fire up Ollama, pull a 70B model, and watch it serve tokens at 18 tok/s. Meanwhile you read a blog post claiming 57 tok/s on the same hardware. What's going on?

The answer involves three intersecting layers: Apple's MLX framework, the model format you're using (GGUF vs MLX), and the inference server sitting in front of it all. Understanding how these interact is the difference between a sluggish local LLM and one that actually delivers.

This guide covers all three layers โ€” how MLX works at the hardware level, where Ollama's performance ceiling comes from, what the benchmark numbers actually mean for real workloads, and a practical optimization checklist for Mac Studio users.

What Is MLX and Why Apple Built It

MLX is an open-source array computation framework developed by Apple, released in late 2023, and designed from the ground up for Apple Silicon hardware. It is not a general-purpose ML framework like PyTorch or TensorFlow. It's a hardware-native framework that treats Apple Silicon's unique architecture as a first-class constraint rather than an afterthought.[3]

To understand why Apple built MLX, you need to understand what's wrong with the alternatives:

MLX fills the gap. Its key design principles:

For LLM inference specifically, the package is mlx-lm. Install it with pip install mlx-lm and you can run most HuggingFace models directly:

# Chat with a local MLX model
mlx_lm.chat --model mlx-community/Qwen2.5-72B-Instruct-4bit

# Convert and quantize a HuggingFace model to MLX format
mlx_lm.convert --hf-path Qwen/Qwen2.5-72B-Instruct -q --upload-repo mlx-community/...

The mlx-community organization on HuggingFace maintains pre-converted, pre-quantized versions of most popular models. This is where most users start.

โ„น๏ธ M5 Neural Accelerators The latest M5 chip introduced dedicated Neural Accelerators โ€” hardware units for matrix multiplication that are critical for the prefill phase (processing input tokens). MLX natively targets these units when available, delivering measurably faster time-to-first-token on M5 hardware compared to M3/M4.[3]

How Unified Memory Changes Everything

Unified memory is the single most important architectural difference between Apple Silicon and every other platform you might run LLMs on. Understanding it physically โ€” not just as a marketing phrase โ€” matters for optimization.

The Traditional GPU Memory Model

On a typical x86 + NVIDIA system, GPU and CPU have separate memory pools connected by PCIe. A 72B model at 4-bit quantization requires roughly 36GB of weights. To run it on an NVIDIA GPU with 48GB VRAM, the weights must be:

  1. Loaded from disk into system RAM
  2. Copied over PCIe bus to VRAM (PCIe 5.0 x16 peaks at ~64 GB/s, typically 20โ€“40 GB/s effective)
  3. Accessed by the GPU during inference

For models that exceed VRAM, weights spill to system RAM and are pulled over PCIe on demand. This is catastrophically slow for LLM inference โ€” you'll see sub-1 tok/s on "offloaded" NVIDIA setups.

The Apple Silicon Model

On Apple Silicon, there is one memory pool. The M3 Ultra's 256GB is accessible to both the CPU cores and the GPU cores at the same bandwidth โ€” up to 800 GB/s on the Ultra. The GPU is not a separate device; it's a separate set of compute units on the same die, sharing the same memory controller.

The practical implications:

โš ๏ธ Memory Pressure Matters macOS dynamically allocates unified memory between the OS, apps, and ML workloads. If your system is under memory pressure (swap active), inference performance degrades sharply. On a 256GB Mac Studio running large models, close memory-hungry apps and check vm_stat for swap activity before benchmarking.
Hardware Memory Bandwidth Max Model Size CPUโ†”GPU Transfer
M3 Ultra (256GB)800 GB/s~230GB (4-bit 72B = 36GB)Zero-copy
M3 Max (128GB)400 GB/s~115GBZero-copy
M4 Max (128GB)546 GB/s~115GBZero-copy
NVIDIA RTX 4090 (24GB)1,008 GB/s24GB (no overflow)PCIe 4.0 (32 GB/s)
NVIDIA A100 (80GB)2,000 GB/s80GB (no overflow)PCIe 4.0/NVLink

The RTX 4090 wins on raw bandwidth and absolute throughput for models that fit in 24GB. The Mac Studio wins for model capacity and the ability to run large models without performance cliffs. The M3 Ultra at 800 GB/s is genuinely competitive for token generation on 70B+ models.

The Model Format Battle: MLX vs GGUF

When people talk about "using MLX" vs "using Ollama," they're often conflating two different things: the inference engine and the model format. They're related but separable.

GGUF (llama.cpp's format)

GGUF is the cross-platform model format used by llama.cpp, Ollama, LM Studio's llama.cpp engine, and most other inference servers. Key properties:

MLX Format (.safetensors + .npz)

MLX models use a combination of SafeTensors and compressed NumPy (.npz) files. They're not a single-file format โ€” they're a directory containing weights, config, and tokenizer files in a layout that MLX's runtime loads directly.

When Each Format Wins

Scenario Winner Why
Max generation throughput (tok/s)MLX 4-bitHardware-tuned kernels, better memory bandwidth utilization
Short prompts, quick responsesGGUF Q4_K_MLower TTFT, faster prefill for typical prompt sizes
Long-context agentic workloads (>4K tokens)GGUF + oMLXMLX prefill latency grows badly with context; GGUF or oMLX cache handles this better
Model availabilityGGUFWider selection, faster community releases
Portability across machinesGGUFSingle file, runs everywhere
Quality at same sizeRoughly equalBoth Q4/4-bit quantizations produce similar output quality
๐Ÿ’ก Tip: Start with GGUF Q4_K_M For most workloads, GGUF Q4_K_M through Ollama or LM Studio gives the best balance of throughput, TTFT, and availability. Switch to MLX only when you specifically need maximum generation throughput for short-context, generation-heavy tasks.

Ollama on Apple Silicon: Strengths and Hidden Costs

Ollama is the default recommendation for running local LLMs โ€” and for good reason. Its developer ergonomics are exceptional: a single CLI, automatic model management, an OpenAI-compatible REST API, and zero configuration. For most use cases, it just works.

But Ollama has a specific performance profile on Apple Silicon that you need to understand before you benchmark it against alternatives.

How Ollama Works on Mac

Ollama is written in Go. On Apple Silicon, it shells out to llama.cpp with Metal GPU acceleration enabled. The pipeline is:

Ollama API request (Go)
    โ†’ llama.cpp runner process
        โ†’ Metal GPU backend
            โ†’ Apple Silicon GPU (M3 Ultra cores)

This is important: Ollama is a wrapper around llama.cpp, not a native inference engine. Every API request goes through Go's HTTP server, gets handed to a C++ subprocess, which manages the Metal compute context.

The 38% Go Wrapper Tax

Real-world benchmarks on an M1 Max Mac Studio (Qwen3.5-35B-A3B, Q4_K_M) showed a significant gap between Ollama and LM Studio running the same GGUF model with the same llama.cpp engine:[5]

That's a 38% throughput penalty for Ollama over bare llama.cpp โ€” purely from the Go wrapper's overhead and process isolation model. The inference engine is identical; only the wrapper differs.

This overhead comes from several sources:

Where Ollama Still Wins

Despite the throughput penalty, Ollama has real advantages:

โ„น๏ธ Ollama MLX Backend: Coming Soon Ollama has officially started work on an MLX backend (PR #9118). When enabled with OLLAMA_NEW_ENGINE=1 OLLAMA_BACKEND=mlx ollama serve, it will use Apple's MLX inference engine instead of llama.cpp. This could eliminate the GGUF-to-MLX format gap and give Ollama users native MLX throughput while keeping all of Ollama's ergonomic advantages. As of early 2026, the PR is in active development.[6]

Benchmark Reality Check

The most dangerous number in local LLM benchmarking is the headline tok/s figure. Here's why it lies, and what to measure instead.

Generation Speed vs. Time-to-First-Token

LLM inference has two distinct phases:

  1. Prefill: Process the input prompt. All input tokens are computed in parallel โ€” this is the compute-intensive phase. Time: linear in prompt length. Output: the KV cache + first output token.
  2. Decode (generation): Generate output tokens one at a time, each conditioned on previous tokens. This is memory-bandwidth-bound โ€” the bottleneck is how fast you can load the model weights from memory for each new token.

When benchmarks report "57 tok/s," they're measuring decode speed on short prompts. The prefill latency is often not reported โ€” but it completely dominates real-world experience for anything longer than a few hundred tokens.

The MLX Prefill Problem

Empirical benchmarks on M1 Max Mac Studio with Qwen3.5-35B-A3B reveal a stark pattern:[5]

Engine Generation Speed Prefill (1K tokens) Total (1K in, 200 out)
MLX 4-bit (LM Studio)57 tok/s~15โ€“20s~19s
GGUF Q4_K_M (LM Studio)29 tok/s~3โ€“5s~11s
Ollama (GGUF)18 tok/s~3โ€“5s~16s
oMLX (tiered KV cache)~55 tok/s~1.7s at 8K~5s

MLX wins generation. GGUF wins end-to-end on longer contexts. oMLX changes the equation entirely.

What Actually Matters for Your Workload

For interactive chat (short prompts, conversational context): Generation speed matters most. MLX or GGUF Q4_K_M through LM Studio are both good choices. Ollama is fine if ergonomics matter more than the last 40% of throughput.

For agentic workloads (tool use, JSON history, long system prompts): Prefill latency dominates. A 4K-token prompt processed at 15โ€“20 seconds (MLX) vs 3โ€“5 seconds (GGUF) makes MLX feel broken even though its tok/s number is higher. Use GGUF Q4_K_M or oMLX for this use case.

For batch processing (document classification, summarization pipelines): Total throughput matters. MLX may win here โ€” the high generation speed compensates for prefill cost when running many short-output tasks. Use the MLX server directly (mlx_lm.server) rather than going through a wrapper.

โš ๏ธ Benchmark Your Actual Workload The specific model (Qwen3.5-35B-A3B has known issues with MLX due to hybrid attention and bf16 weights on M1), hardware generation, and context window size all shift which engine wins. Don't trust generic benchmarks โ€” measure your prompt distribution on your hardware.

When to Use MLX vs Ollama vs llama.cpp

This is a decision framework, not a ranking. Each option wins in different scenarios.

Use MLX Directly (mlx_lm.server)

Use Ollama

Use llama.cpp Directly (or LM Studio)

Use oMLX

Optimization Guide for Mac Studio

Practical steps to maximize LLM performance on a Mac Studio M3 Ultra (256GB).

1. Choose the Right Quantization Level

Quantization reduces model size at the cost of some quality. On Apple Silicon, the memory bandwidth benefit of smaller models often outweighs quality loss up to a point:

QuantizationSize (70B model)Quality ImpactRecommendation
F16 (16-bit)~140GBNoneUse if RAM allows โ€” baseline quality
Q8_0~74GBNegligibleGood quality/size tradeoff, MLX 8-bit
Q4_K_M (GGUF)~40GBSmallBest practical choice for most use cases
MLX 4-bit~35GBSmallBest generation speed on Apple Silicon
Q3_K_M~30GBModerateUse only if you need to fit a larger model
Q2_K~25GBSignificantAvoid โ€” quality degradation is too high

With 256GB on your Mac Studio, you can comfortably fit a 70B model at Q8_0 (74GB) or even a 200B+ model at 4-bit. Don't over-quantize just to save memory you have.

2. Maximize GPU Layer Offloading (GGUF/Ollama)

For llama.cpp and Ollama, the -ngl (number of GPU layers) parameter determines how much of the model runs on the Metal GPU vs CPU. On Mac Studio, set this to max:

# llama.cpp: offload all layers to GPU
./llama-cli -m model.gguf -ngl 99 -p "your prompt"

# Ollama: set in Modelfile or environment
OLLAMA_NUM_GPU=999 ollama serve

# Or in Modelfile:
FROM qwen2.5:72b
PARAMETER num_gpu 999

On a Mac Studio with 256GB unified memory, there's no reason to have any layers on CPU โ€” everything fits in memory with bandwidth to spare.

3. Context Window and KV Cache

The KV cache stores attention keys/values for all tokens in the current context. Its memory footprint is:

KV cache size = 2 ร— layers ร— heads ร— head_dim ร— context_length ร— precision

For a 72B model at 4-bit with a 32K context window, the KV cache alone can consume 8โ€“16GB. Larger context = more memory pressure = slower inference.

Practical guidelines:

4. Keep Models Loaded

Cold-start model loading on a 70B model can take 10โ€“30 seconds. Both Ollama and MLX server keep models in memory between requests by default โ€” make sure you're not triggering unloads:

# Ollama: set keep-alive to permanent
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Or per-request:
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:72b",
  "keep_alive": -1,
  "prompt": "hello"
}'

5. Monitor GPU Utilization

# Real-time GPU and memory stats on macOS
sudo powermetrics --samplers gpu_power -i 1000

# Check unified memory pressure
vm_stat | grep "Pages"

# Ollama model status
curl http://localhost:11434/api/ps | python3 -m json.tool

Watch for GPU utilization dropping below 80% during inference โ€” this indicates either a memory bandwidth bottleneck (model too large for available bandwidth) or a CPU-bound bottleneck (prefill on a very long prompt).

6. Batch Size and Parallelism

For single-user interactive use, batch size 1 is optimal โ€” it minimizes latency. For serving multiple concurrent users or batch processing pipelines, increase the batch size:

# MLX server: enable concurrent requests
mlx_lm.server --model mlx-community/Qwen2.5-72B-Instruct-4bit --port 8080

# llama.cpp: parallel sequences
./llama-server -m model.gguf -ngl 99 --parallel 4 --ctx-size 32768

vllm-mlx demonstrates that continuous batching at 16 concurrent requests achieves 4.3x aggregate throughput on the same hardware โ€” a significant win for multi-user serving scenarios.[2]

Future: Ollama MLX Backend and vllm-mlx

Two developments will significantly change the landscape for Apple Silicon LLM inference in 2026.

Ollama MLX Backend (PR #9118)

Ollama's most requested feature on Apple Silicon is an MLX backend. The PR is live at github.com/ollama/ollama/pull/9118. When complete, this will replace the llama.cpp runner with an MLX inference engine while keeping all of Ollama's API and model management layer intact.

The practical impact: Ollama users would get MLX's generation throughput without giving up ollama pull, the OpenAI API, or multi-model serving. The 38% Go wrapper overhead would likely decrease as the integration matures. To try the early version:

OLLAMA_NEW_ENGINE=1 OLLAMA_BACKEND=mlx ollama serve

Expect instability โ€” it's early stage. But the direction is clear.

vllm-mlx: Continuous Batching for Apple Silicon

vllm-mlx is a framework built natively on MLX that brings production-grade serving capabilities to Apple Silicon:[2]

For developers running multi-user inference servers on Mac Studio hardware โ€” this is the path to production-grade Apple Silicon serving. It's open source and worth watching closely.

oMLX: Solving the Prefill Problem

oMLX (github.com/jundot/omlx) is an MLX inference server with a tiered KV cache that persists to SSD. On the specific benchmark (M1 Max, Qwen3.5-35B-A3B, 8K context):

The SSD-backed KV cache means that repeated similar prompts (common in agentic loops with a large system prompt) hit the cache rather than recomputing. On M3 Ultra with fast NVMe, this is particularly effective.

References

  1. Shrivastava et al. โ€” Production-Grade Local LLM Inference on Apple Silicon (arXiv:2511.05502) โ€” Systematic benchmark of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS on M2 Ultra Mac Studio. Covers TTFT, throughput, latency percentiles, and long-context behavior. โ†— arxiv.org
  2. vllm-mlx โ€” Native LLM and MLLM Inference at Scale on Apple Silicon (arXiv:2601.19139) โ€” vllm-mlx framework with continuous batching and multimodal prefix caching. 525 tok/s on M4 Max, 4.3x aggregate throughput at 16 concurrent requests. โ†— arxiv.org
  3. Apple ML Research โ€” Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU โ€” Official Apple research post on MLX architecture, M5 Neural Accelerators, and MLX LM usage. โ†— machinelearning.apple.com
  4. Markus Schall โ€” Local AI with MLX on the Mac (Nov 2025) โ€” Practical comparison of MLX, Ollama, llama.cpp, and LM Studio on Apple Silicon. Covers model formats and deployment workflows. โ†— markus-schall.de
  5. famstack.dev โ€” "57 tok/s on Screen, 3 tok/s in Practice: MLX vs llama.cpp on Apple Silicon" (Mar 2026) โ€” Detailed real-world benchmark on M1 Max Mac Studio. Covers MLX prefill problem, Ollama wrapper overhead (38%), and oMLX tiered KV cache results. โ†— famstack.dev
  6. r/ollama โ€” "Ollama is getting MLX support!" (Mar 2025) โ€” Community thread covering Ollama's official MLX backend work, PR #9118, and how to try the early version. โ†— reddit.com