The 90% Problem Nobody Talks About
A few weeks ago, the Unsloth team posted a finding that should have sent shockwaves through the local AI community: Claude Code, when used with locally-served models, was making inference 90% slower than it should be. Not because of model size. Not because of quantization. Because of a single, fixable behavior in how Claude Code formats its messages.
The root cause is KV cache invalidation — and while the Claude Code case is a dramatic example, it's one instance of a much broader problem. If you're running agentic workloads on local models, there's a very good chance you're leaving most of your hardware's performance on the table right now.
This piece explains why.
What KV Cache Actually Does
To understand why KV cache invalidation matters so much, you need to understand the two phases of LLM inference.
When you send a prompt to a model, inference starts with the Prefill phase. The entire input is tokenized, and the model processes all input tokens in parallel — computing attention scores, feed-forward activations, and crucially, the Key and Value tensors for every token in the context. This is compute-intensive but fast, because GPUs excel at parallel matrix operations.
Then comes the Decode phase. The model generates output one token at a time. This is fundamentally sequential — each new token depends on all previous tokens, so you can't parallelize it. Each decode step requires a forward pass through the entire model.
Here's the problem: without caching, each decode step would need to recompute the Key and Value tensors for every token in the entire context — including the original prompt. If your system prompt is 5,000 tokens and you've generated 500 tokens so far, step 501 would recompute attention for all 5,500 tokens just to produce one more output token.
KV cache solves this by storing the Key and Value tensors computed during prefill and each decode step. When the next token is generated, only the K/V tensors for the new token need to be computed. The rest are retrieved from cache. This reduces the decode phase from O(N²) total compute to O(N).
O(N) vs O(N²): The Real Cost in Numbers
The complexity difference isn't abstract. Let's make it concrete.
Assume you're running Qwen3.5-35B-A3B at 90 tokens/sec on an RTX 3090. You have a 10,000-token context (a typical agentic session with system prompt, tool schemas, and conversation history). You're on turn 50 of an agentic loop.
| Scenario | Tokens recomputed per turn | Time per turn @90 tok/s |
|---|---|---|
| With working KV cache | ~200 (new tokens only) | ~2.2 seconds |
| Cache invalidated every turn | ~10,000 (full context) | ~111 seconds |
| Slowdown factor | — | ~50x |
Real-world slowdown depends on context length and turn count. At short contexts the gap is small. At long agentic sessions it becomes catastrophic. The 90% figure from Unsloth was measured at typical agentic context lengths — and it's conservative for very long sessions.
The underlying bottleneck in the decode phase is memory bandwidth, not compute. Every forward pass loads model weights from VRAM into GPU compute cores. With KV cache, you also need to load the cached K/V tensors — but that's fast. Without KV cache, you need to recompute them from scratch, which burns both memory bandwidth and compute cycles.
Why Agentic Systems Suffer Most
KV cache benefits are maximized when the prefix of the context is stable across multiple requests. In a traditional single-turn chat application, each request is independent — no prefix to cache. But agentic systems are exactly the opposite:
- Long, stable system prompts — tool schemas, personality instructions, memory context. These can be 2,000–8,000 tokens that never change.
- Multi-turn conversation — each turn extends the context by a small amount. The prefix grows, but it's the same prefix from turn N to turn N+1.
- Tool call results — execution output appended to context. Again, stable prefix, small new content.
- Iterative refinement loops — the agent reads, thinks, acts, reads again. Long sessions with deeply reusable prefixes.
In other words, agentic AI is the use case most suited to benefit from KV cache reuse — and therefore the use case that suffers most when that cache is invalidated.
The Claude Code Bug: A Case Study in Cache Destruction
In early March 2026, the Unsloth team published a critical finding: Claude Code (post-January 2026 versions) prepends a changing attribution header to every message it sends to the model. This header includes a session identifier, turn counter, or timestamp — values that differ between turns.
Because this changing text appears at the very beginning of every message, it invalidates the prefix cache on every single turn. The model cannot reuse any KV computations from previous turns, regardless of how much context is shared.
The impact scales with context length. In a 5,000-token agentic session at turn 20:
- Expected behavior: Recompute ~250 new tokens per turn → fast, responsive
- Actual behavior: Recompute all 5,000 tokens every turn → O(N²) total compute
The result is 90% slower inference — confirmed across multiple local inference setups using llama.cpp and vLLM. For a model that should generate at 90 tok/sec on an RTX 3090, effective throughput drops to under 10 tok/sec at long contexts.
Solutions: From Quick Fix to Architecture
The Claude Code fix is one solution to one instance of the problem. A more systematic approach covers four layers:
1. Stable Prefix Architecture
Design your prompt structure so that stable content comes first and dynamic content comes last. The golden rule: anything that doesn't change between turns should be at the beginning of the context.
- System prompt (tools, instructions, persona) → stable, goes first
- Long-term memory or RAG results → stable per session, goes next
- Conversation history → grows each turn, but previous turns are stable
- Current user input → always last, always new
Violating this order — such as injecting a timestamp or session ID at the top of each turn — destroys cache reuse for everything that follows.
2. Enable Prefix Caching in Your Inference Engine
Most production-grade inference engines support prefix KV cache, but it's not always on by default:
- llama.cpp: Use the
--cache-reuseflag when launchingllama-server - vLLM: Set
enable_prefix_caching=Truein engine config - SGLang: Prefix caching is enabled by default in recent versions
- Ollama: Handles prefix caching internally; ensure you're not resetting context between calls
3. Cross-Query Cache Sharing (Production Scale)
For production deployments with multiple users or high-concurrency agentic systems, tools like LMCache extend KV cache beyond a single query. Rather than discarding cache after each response, LMCache persists it across requests and shares it between inference engine instances.
LMCache published benchmarks showing up to 15x throughput improvement for multi-round Q&A workloads where a common system prompt prefix is shared. At scale, this means dramatically lower GPU hour costs for systems with repetitive prefix patterns (customer service bots, coding agents with standard tool schemas, etc.).
4. Audit Your Tooling
The Claude Code case is a reminder to audit everything in your inference pipeline for prefix stability:
- Does your agent framework inject timestamps, UUIDs, or turn counters at the message start?
- Does your system prompt template include dynamic content (current date, user name) that changes per request?
- Does your memory injection system prepend new memories rather than appending them?
Each of these breaks prefix caching. Find them, move dynamic content to the end, or parameterize it away.
Qwen3.5's Answer: Linear Attention Sidesteps the Problem
While the above fixes address KV cache management, Qwen3.5-35B-A3B takes a more architectural approach that reduces dependence on traditional KV cache entirely.
The model uses a hybrid attention architecture: a 4-layer repeating cycle of 3 Gated DeltaNet layers followed by 1 full softmax attention layer. Gated DeltaNet implements linear attention — complexity O(N) with respect to sequence length, not O(N²). Rather than computing attention scores between all token pairs, it maintains a fixed-size recurrent state that's updated with each new token.
The implication: three out of every four layers have no KV cache bottleneck at all. The cache only matters for the one softmax attention layer in each cycle. This means:
- Long contexts become dramatically cheaper to process
- Cache invalidation has 75% less impact than on a standard transformer
- 262K tokens of context is natively supported without the memory explosion standard attention would require
This isn't a cache management trick — it's an architectural solution that makes the problem structurally smaller. It comes with its own tradeoffs (linear attention can be weaker at very long-range dependencies), but for the coding and agentic use cases Qwen3.5 targets, it's a meaningful engineering choice.
Practical Checklist for Local AI Builders
- System prompt is static — no timestamps, session IDs, or dynamic content at the top
- Dynamic content (user input, tool results) is appended at the end of context
- Inference engine has prefix caching enabled (
--cache-reuse,enable_prefix_caching=True) - If using Claude Code with local models: Unsloth KV cache fix is applied
- Agent framework doesn't prepend changing headers/metadata to each message
- Memory injection appends to context rather than inserting into the prefix
- For high-concurrency production: consider LMCache for cross-query cache sharing
The Bottom Line
KV cache is one of the most important performance optimizations in LLM inference, and it's one of the easiest to accidentally break. The performance difference between working and broken cache at long agentic context lengths isn't 10% or 20% — it's an order of magnitude.
The good news: most of the fixes are simple, once you know where to look. Stable prefixes, the right inference engine flags, and auditing your tooling for dynamic prefix injection will recover most of the lost performance. The Claude Code fix is a concrete example of what's possible: a single patch restoring 90% of lost throughput.
If you're running local models for agentic work and haven't thought about prefix stability, it's worth a few hours of your time. The gains are real.
Sources: Unsloth Claude Code Guide · LMCache Paper (arXiv 2510.09665) · LLM Inference Optimization (TowardsAI) · Qwen3.5-35B-A3B (HuggingFace)