📺 Watch the video version: ThinkSmart.Life/youtube

The 90% Problem Nobody Talks About

A few weeks ago, the Unsloth team posted a finding that should have sent shockwaves through the local AI community: Claude Code, when used with locally-served models, was making inference 90% slower than it should be. Not because of model size. Not because of quantization. Because of a single, fixable behavior in how Claude Code formats its messages.

The root cause is KV cache invalidation — and while the Claude Code case is a dramatic example, it's one instance of a much broader problem. If you're running agentic workloads on local models, there's a very good chance you're leaving most of your hardware's performance on the table right now.

This piece explains why.

What KV Cache Actually Does

To understand why KV cache invalidation matters so much, you need to understand the two phases of LLM inference.

When you send a prompt to a model, inference starts with the Prefill phase. The entire input is tokenized, and the model processes all input tokens in parallel — computing attention scores, feed-forward activations, and crucially, the Key and Value tensors for every token in the context. This is compute-intensive but fast, because GPUs excel at parallel matrix operations.

Then comes the Decode phase. The model generates output one token at a time. This is fundamentally sequential — each new token depends on all previous tokens, so you can't parallelize it. Each decode step requires a forward pass through the entire model.

Here's the problem: without caching, each decode step would need to recompute the Key and Value tensors for every token in the entire context — including the original prompt. If your system prompt is 5,000 tokens and you've generated 500 tokens so far, step 501 would recompute attention for all 5,500 tokens just to produce one more output token.

KV cache solves this by storing the Key and Value tensors computed during prefill and each decode step. When the next token is generated, only the K/V tensors for the new token need to be computed. The rest are retrieved from cache. This reduces the decode phase from O(N²) total compute to O(N).

O(N) vs O(N²): The Real Cost in Numbers

The complexity difference isn't abstract. Let's make it concrete.

Assume you're running Qwen3.5-35B-A3B at 90 tokens/sec on an RTX 3090. You have a 10,000-token context (a typical agentic session with system prompt, tool schemas, and conversation history). You're on turn 50 of an agentic loop.

ScenarioTokens recomputed per turnTime per turn @90 tok/s
With working KV cache~200 (new tokens only)~2.2 seconds
Cache invalidated every turn~10,000 (full context)~111 seconds
Slowdown factor~50x

Real-world slowdown depends on context length and turn count. At short contexts the gap is small. At long agentic sessions it becomes catastrophic. The 90% figure from Unsloth was measured at typical agentic context lengths — and it's conservative for very long sessions.

The underlying bottleneck in the decode phase is memory bandwidth, not compute. Every forward pass loads model weights from VRAM into GPU compute cores. With KV cache, you also need to load the cached K/V tensors — but that's fast. Without KV cache, you need to recompute them from scratch, which burns both memory bandwidth and compute cycles.

Why Agentic Systems Suffer Most

KV cache benefits are maximized when the prefix of the context is stable across multiple requests. In a traditional single-turn chat application, each request is independent — no prefix to cache. But agentic systems are exactly the opposite:

In other words, agentic AI is the use case most suited to benefit from KV cache reuse — and therefore the use case that suffers most when that cache is invalidated.

⚠️ The Core Rule Any change to the beginning of the context invalidates the entire prefix cache from that point forward. A single character difference in a 5,000-token system prompt means the model recomputes all 5,000 tokens, even if only one token changed.

The Claude Code Bug: A Case Study in Cache Destruction

In early March 2026, the Unsloth team published a critical finding: Claude Code (post-January 2026 versions) prepends a changing attribution header to every message it sends to the model. This header includes a session identifier, turn counter, or timestamp — values that differ between turns.

Because this changing text appears at the very beginning of every message, it invalidates the prefix cache on every single turn. The model cannot reuse any KV computations from previous turns, regardless of how much context is shared.

The impact scales with context length. In a 5,000-token agentic session at turn 20:

The result is 90% slower inference — confirmed across multiple local inference setups using llama.cpp and vLLM. For a model that should generate at 90 tok/sec on an RTX 3090, effective throughput drops to under 10 tok/sec at long contexts.

✅ The Fix Unsloth published a patch that stabilizes the Claude Code prefix — making the attribution header consistent across turns. With the fix applied, the prefix cache is reused and performance returns to baseline. Details at: unsloth.ai/docs/basics/claude-code

Solutions: From Quick Fix to Architecture

The Claude Code fix is one solution to one instance of the problem. A more systematic approach covers four layers:

1. Stable Prefix Architecture

Design your prompt structure so that stable content comes first and dynamic content comes last. The golden rule: anything that doesn't change between turns should be at the beginning of the context.

Violating this order — such as injecting a timestamp or session ID at the top of each turn — destroys cache reuse for everything that follows.

2. Enable Prefix Caching in Your Inference Engine

Most production-grade inference engines support prefix KV cache, but it's not always on by default:

3. Cross-Query Cache Sharing (Production Scale)

For production deployments with multiple users or high-concurrency agentic systems, tools like LMCache extend KV cache beyond a single query. Rather than discarding cache after each response, LMCache persists it across requests and shares it between inference engine instances.

LMCache published benchmarks showing up to 15x throughput improvement for multi-round Q&A workloads where a common system prompt prefix is shared. At scale, this means dramatically lower GPU hour costs for systems with repetitive prefix patterns (customer service bots, coding agents with standard tool schemas, etc.).

4. Audit Your Tooling

The Claude Code case is a reminder to audit everything in your inference pipeline for prefix stability:

Each of these breaks prefix caching. Find them, move dynamic content to the end, or parameterize it away.

Qwen3.5's Answer: Linear Attention Sidesteps the Problem

While the above fixes address KV cache management, Qwen3.5-35B-A3B takes a more architectural approach that reduces dependence on traditional KV cache entirely.

The model uses a hybrid attention architecture: a 4-layer repeating cycle of 3 Gated DeltaNet layers followed by 1 full softmax attention layer. Gated DeltaNet implements linear attention — complexity O(N) with respect to sequence length, not O(N²). Rather than computing attention scores between all token pairs, it maintains a fixed-size recurrent state that's updated with each new token.

The implication: three out of every four layers have no KV cache bottleneck at all. The cache only matters for the one softmax attention layer in each cycle. This means:

This isn't a cache management trick — it's an architectural solution that makes the problem structurally smaller. It comes with its own tradeoffs (linear attention can be weaker at very long-range dependencies), but for the coding and agentic use cases Qwen3.5 targets, it's a meaningful engineering choice.

Practical Checklist for Local AI Builders

📋 KV Cache Health Checklist
  1. System prompt is static — no timestamps, session IDs, or dynamic content at the top
  2. Dynamic content (user input, tool results) is appended at the end of context
  3. Inference engine has prefix caching enabled (--cache-reuse, enable_prefix_caching=True)
  4. If using Claude Code with local models: Unsloth KV cache fix is applied
  5. Agent framework doesn't prepend changing headers/metadata to each message
  6. Memory injection appends to context rather than inserting into the prefix
  7. For high-concurrency production: consider LMCache for cross-query cache sharing

The Bottom Line

KV cache is one of the most important performance optimizations in LLM inference, and it's one of the easiest to accidentally break. The performance difference between working and broken cache at long agentic context lengths isn't 10% or 20% — it's an order of magnitude.

The good news: most of the fixes are simple, once you know where to look. Stable prefixes, the right inference engine flags, and auditing your tooling for dynamic prefix injection will recover most of the lost performance. The Claude Code fix is a concrete example of what's possible: a single patch restoring 90% of lost throughput.

If you're running local models for agentic work and haven't thought about prefix stability, it's worth a few hours of your time. The gains are real.


Sources: Unsloth Claude Code Guide · LMCache Paper (arXiv 2510.09665) · LLM Inference Optimization (TowardsAI) · Qwen3.5-35B-A3B (HuggingFace)