The Wrong Tool for the Job
A debate broke out on X this week between @TheAhmadOsman and @mweinbach about NVIDIA's DGX Spark. Max Weinbach had been getting "insane tool calling issues" on the Spark and was told by NVIDIA to fall back to llama.cpp. Ahmad's response was blunt: that's the wrong call, and the comparison between DGX Spark and RTX PRO 6000 being made wasn't valid because the two machines have fundamentally different memory architectures — and the inference engine choice was making things worse, not better.[1]
The exchange surfaced a confusion that's extremely common in the local AI community: the assumption that inference engines are interchangeable wrappers around the same underlying performance. They are not. The engine you choose determines whether you're using your hardware at 5% of its potential or 95%. On a multi-GPU rig, picking the wrong engine doesn't just leave performance on the table — it actively serializes work that should run in parallel.
What Is an Inference Engine?
An inference engine is software that translates a human prompt into the forward pass of a neural network and returns a coherent token stream. That sounds simple but hides an enormous amount of complexity. Every inference engine must implement:
- Model architecture support — each LLM architecture (dense transformer, MoE, etc.) requires specific code to handle attention, weight layout, and token generation
- Hardware backends — CUDA, ROCm, Metal, CPU — each with different optimization paths
- Quantization formats — GGUF, EXL2, AWQ, GPTQ, FP8, BF16 — each requiring its own kernel implementations
- Parallelism strategy — how work is split across multiple GPUs
- Batching and KV cache management — how multiple concurrent requests share compute
Because each of these layers stacks on top of the others, engines optimize for different combinations. An engine that excels at CPU offloading may be completely unequipped for GPU tensor parallelism — and vice versa. This is not a minor implementation detail. It determines your tokens-per-second ceiling by an order of magnitude.[2]
It's Always Memory Bandwidth
Before comparing engines, it helps to understand why memory bandwidth is the primary performance driver for LLM inference. In autoregressive generation — producing one token at a time — the GPU must load the model's weight matrices from VRAM into compute registers on every forward pass. For a 70B model at Q4 quantization (~40GB), the GPU reads tens of gigabytes of data per token generated. The rate at which it can do that reading is memory bandwidth.
GPU compute (TFLOPS) is largely not the bottleneck. During inference, the arithmetic operations complete much faster than the memory transfers that feed them. The GPU is memory-bound, not compute-bound. This is why memory bandwidth numbers almost perfectly predict inference speed — and why two pieces of hardware with the same VRAM but different bandwidth can produce dramatically different throughput.
DGX Spark vs RTX PRO 6000: Not a Fair Fight
Ahmad's core point in the thread: comparing DGX Spark and RTX PRO 6000 as though they're equivalent local AI machines misunderstands what each device actually is.
| Device | Memory Type | Memory Bandwidth | VRAM | Architecture |
|---|---|---|---|---|
| NVIDIA DGX Spark | Unified (CPU+GPU shared) | >273 GB/s | 128GB unified | Grace Blackwell |
| RTX PRO 6000 Blackwell | Dedicated GDDR7 | >1,792 GB/s | 96GB | Blackwell GPU |
The RTX PRO 6000 has 6.5× more memory bandwidth than the DGX Spark.[1] LMSYS benchmarks confirm the math plays out in practice: running GPT-OSS 20B in MXFP4 format, the RTX PRO 6000 achieved 10,108 tps prefill / 215 tps decode versus the DGX Spark's 2,053 tps prefill / 49.7 tps decode — roughly 4–5× faster.[3] The bandwidth ratio and the performance ratio align almost exactly.
The DGX Spark uses Grace Blackwell's unified memory architecture — CPU and GPU share the same memory pool, which allows extremely large models to run without the hard VRAM limits of discrete GPUs. But shared memory has lower bandwidth than dedicated GDDR7. The Spark's value proposition is capacity and convenience (128GB unified, runs enormous models, compact form factor) — not raw inference speed. Two RTX PRO 6000 cards cost roughly $12,000 and outperform the Spark; the Spark's $3,000 price point buys simplicity and memory capacity, not throughput.[3]
llama.cpp: What It's Actually For
llama.cpp is the most popular LLM inference engine in the open source ecosystem. Its community is large, it supports the widest range of model architectures, and it's accessible to anyone with a laptop. None of that makes it the right tool for a multi-GPU server.
The core limitation: llama.cpp does not support tensor parallelism and almost certainly never will. This is a confirmed design decision, documented in a GitHub issue where the maintainers explained that tensor parallelism requires a fundamentally different execution model than llama.cpp is built around.[2] The GGUF format and the execution model llama.cpp uses are optimized for a different use case: running models that don't fit in VRAM by offloading layers to CPU RAM.
llama.cpp also does not support true batch inference. It processes requests sequentially. If you send 50 concurrent requests to a llama.cpp server, they queue up and run one at a time. On a multi-GPU server serving multiple users, this means 49 of your 50 requests are always waiting while one runs.
When llama.cpp is the right choice:
- You don't have enough VRAM to fit the model — CPU/RAM offloading is your only option
- You're running on a single consumer GPU with no multi-GPU setup
- You need support for an obscure model architecture that newer engines haven't implemented yet
- You're doing single-user interactive inference where sequential processing is fine
Ahmad's benchmark illustrates the gap starkly: on his 14× RTX 3090 AI server, CPU offloading of DeepSeek v2.5 236B BF16 via llama.cpp produced approximately 1 token per second. Switching to 8× GPUs with vLLM and batch inference produced over 800 tokens per second across 50 concurrent requests — roughly 800× more throughput on the same hardware.[2]
Tensor Parallelism: The Multi-GPU Unlock
Layer parallelism (pipeline parallelism) splits model layers across GPUs: GPU 0 runs layers 0–20, GPU 1 runs layers 21–40, etc. It's simple to implement and requires minimal inter-GPU communication — just the activation tensor passed at layer boundaries. Ollama and llama.cpp use this by default. The downside: every token still passes through GPUs sequentially, so adding GPUs reduces VRAM pressure but doesn't proportionally reduce latency.
Tensor parallelism does something fundamentally different: it splits each individual matrix multiplication across all GPUs simultaneously. Every layer runs on all GPUs in parallel, each handling a horizontal slice of the weight matrices and attention heads. After each layer, an all-reduce operation synchronizes the results across GPUs. The result: adding GPUs directly reduces the time per token, not just the memory requirement per GPU.
The tradeoff: tensor parallelism requires high-bandwidth inter-GPU communication (~20 Gbps sustained). On systems with NVLink between GPUs, this is fast. On systems where GPUs communicate via PCIe only, cross-card latency becomes a bottleneck — though for layer-level batch inference it's still far better than serial processing.
vLLM: The Multi-GPU Standard for Throughput
vLLM was built from the ground up for high-throughput multi-GPU inference. Its key contributions:
- PagedAttention — a KV cache management system inspired by OS virtual memory paging. Eliminates memory fragmentation from variable-length sequences, allowing much more efficient use of VRAM across concurrent requests.
- Continuous batching — instead of waiting for a batch to complete before starting the next, vLLM interleaves requests dynamically, keeping GPUs saturated at all times.
- Tensor parallelism — native support for splitting work across multiple GPUs, with proper all-reduce communication.
Ahmad's benchmark: 50 concurrent requests, 2,000 tokens each, 8× RTX 3090 GPUs — completed in 2 minutes 29 seconds. That's approximately 800+ output tokens per second of aggregate throughput.[2] vLLM is the standard choice for serving LLMs under load on multi-GPU hardware.
vLLM works best when:
- You have multiple GPUs and need tensor parallelism
- You're serving concurrent requests (APIs, multi-user environments)
- Your models fit entirely in GPU VRAM (no CPU offload needed)
- You need batch inference throughput over individual response latency
ExLlamaV2: Tensor Parallelism When VRAM Is Constrained
ExLlamaV2 introduced its EXL2 quantization format — a calibrated quantization that allows per-layer mixed precision, squeezing better quality out of lower average bit depths than GPTQ or GGUF can achieve at the same file size. An EXL2 model at 4 bpw (bits per weight) typically outperforms a GGUF Q4_K_M of the same model on perplexity benchmarks.
ExLlamaV2 added tensor parallelism support in 2024, making it a viable alternative to vLLM when your model requires more VRAM than you have. EXL2 quantization can push larger models onto the same GPU cluster that vLLM requires more VRAM for. The tradeoff: ExLlamaV2's batch inference implementation is less mature than vLLM's, and its ecosystem is smaller.
ExLlamaV2 works best when:
- You need tensor parallelism but VRAM is constrained
- You want maximum quality at a given quantization size (EXL2 vs GGUF)
- Single-user or low-concurrency serving is acceptable
SGLang: Speed and Structured Outputs
SGLang (Structured Generation Language) started as a framework for structured LLM programs and has evolved into a high-performance inference engine in its own right. It matches or exceeds vLLM on throughput for many workloads and adds native support for constrained generation — JSON schemas, regex patterns, grammar-based outputs — without the overhead of post-processing hacks.
Ahmad and others in the thread recommend vLLM or SGLang as equally valid choices for GPU inference.[1] SGLang's structured output support makes it particularly well-suited for agentic workloads where tool calling requires structured JSON — though tool calling reliability still depends heavily on the model architecture and how well the inference engine implements that model's specific tool calling format.
Tool Calling Is Still Broken (And Here's Why)
Max Weinbach's original issue — "insane tool calling issues" on DGX Spark — surfaces a real problem that goes beyond the inference engine debate. Tool calling is not a universal standard that works the same everywhere. Every model architecture implements function calling differently. Every inference engine must implement that specific implementation correctly. The result is a matrix of (model × engine) compatibility where many combinations are partially broken or inconsistent.[1]
From Ahmad's follow-up: "Each inference engine must implement the model architecture and its tool calling. Getting a model to run correctly isn't trivial, many parts are still inconsistent/broken. Set the right baseline: match the GPUs, the inference engine for those GPUs, and the right inference engine for the model."
This is the practical reality for anyone building AI agents on local hardware. A tool that works on llama.cpp may fail on vLLM not because vLLM is worse, but because the tool calling schema implementation differs between them. The safest approach: test tool calling explicitly on your target stack, don't assume portability, and factor engine support for your specific model's architecture into your choice.
Which Engine to Use When: A Decision Framework
| Situation | Recommended Engine | Why |
|---|---|---|
| Model doesn't fit in VRAM, need CPU offload | llama.cpp | Only engine with solid CPU offload + GPU hybrid |
| Single GPU, single user, interactive | Ollama / llama.cpp | Simplicity wins; tensor parallelism not needed |
| Multi-GPU, concurrent requests, model fits in VRAM | vLLM | Best throughput, PagedAttention, continuous batching |
| Multi-GPU, VRAM tight, need to quantize further | ExLlamaV2 | EXL2 quantization + tensor parallelism |
| Agentic workloads with structured JSON outputs | SGLang | Native structured generation, competitive throughput |
| NVIDIA hardware, production, latency-critical | TensorRT-LLM | Compiled kernels, hardware-optimized (when mature) |
| DGX Spark, today | vLLM or SGLang | TensorRT-LLM not yet faster in practice on Spark |
The meta-rule: match the engine to the hardware and the workload, not to what's most popular or what NVIDIA's support page recommends for a different use case. llama.cpp's popularity comes from its accessibility, not its performance ceiling. On a multi-GPU rig with models resident in VRAM, it leaves most of that hardware idle.
References
- @TheAhmadOsman on X — DGX Spark vs RTX PRO 6000 inference engine thread — March 19, 2026. The original thread triggering this analysis: memory bandwidth comparison, llama.cpp critique, TensorRT-LLM context. ↗ link
- Ahmad Osman — "Stop Wasting Your Multi-GPU Setup With llama.cpp" — ahmadosman.com, February 7, 2025. Deep dive into inference engine selection with benchmarks from a 14× RTX 3090 server: 1 tok/s CPU offload vs 800+ tok/s vLLM batch inference. ↗ link
- LMSYS Org — "NVIDIA DGX Spark In-Depth Review" — lmsys.org, October 2025. Benchmark data: RTX PRO 6000 at 10,108/215 tps vs DGX Spark at 2,053/49.7 tps on GPT-OSS 20B MXFP4. ↗ link
- LocalLLaMA — RTX Pro 6000 vs DGX Spark benchmark visualization — Reddit, October 2025. Community analysis confirming 6.5× bandwidth advantage maps to ~6× performance advantage. ↗ link
- llama.cpp GitHub — Tensor Parallelism discussion — Issue #9086, 2024. Maintainers confirm tensor parallelism is not planned for llama.cpp due to architectural constraints. ↗ link