🎧 Listen ~8 min
📺 Watch the video version: DualPath: Breaking the KV-Cache Bottleneck

Introduction

If you're building AI agents that use tool calls, multi-turn reasoning, or long context windows, you've probably assumed that GPU compute is your primary scaling bottleneck. DeepSeek's new paper, "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference", says you're wrong.

The real bottleneck? KV-Cache loading from external storage. Every time your agent starts a new turn in a multi-round conversation, the entire KV-Cache — the cached key-value pairs from previous tokens — must be loaded from disk or network storage back into GPU memory. As agents grow more capable with longer contexts and more tool-call loops, this I/O cost becomes the dominant factor limiting throughput.

Published on February 25, 2026 by a team of 13 researchers from DeepSeek, DualPath is not a new model architecture. It's an infrastructure-level optimization that exploits idle network bandwidth on decode engines to create a second data path for KV-Cache loading. The result: up to 1.87× improvement in offline inference throughput and 1.96× in online serving — all without violating latency service-level objectives (SLOs).[1]

⚡ One-Line Summary DualPath uses RDMA to load KV-Cache through idle decode-engine NICs instead of overloaded prefill-engine NICs, nearly doubling throughput for agentic workloads.

The Real Bottleneck: Data Movement, Not Computation

Modern LLM inference uses a mechanism called KV-Cache to avoid recomputing attention over previously processed tokens. In a multi-turn conversation, the KV-Cache from prior turns is saved to external storage (typically distributed storage systems) and reloaded when the next turn begins. This is essential for efficiency — without it, every turn would require re-processing the entire conversation from scratch.

The problem emerges at scale. In DeepSeek's production environment, KV-Cache hit rates approach 95–99% — meaning the vast majority of tokens in a new turn have already been processed and cached.[3] That's great for compute savings, but terrible for I/O: all those cached KV pairs must be loaded from storage into GPU memory before the prefill engine can process the new tokens.

As Jeff Dean famously put it: "Computation is cheap; data movement is expensive."[2]

The numbers tell the story. In disaggregated inference architectures where prefill and decode happen on separate engine pools, the storage NICs on prefill engines become bandwidth-saturated while the NICs on decode engines sit idle. This asymmetry creates a fundamental bottleneck: no matter how many GPUs you throw at the problem, throughput is capped by how fast you can feed data to the prefill engines.

Prefill-Decode Separation: The Architecture That Creates the Problem

To understand DualPath, you first need to understand PD separation (Prefill-Decode disaggregation) — the prevailing architecture for large-scale LLM serving.

How PD Separation Works

Where the Bottleneck Forms

In agentic workloads with high KV-Cache reuse, the flow looks like this:

┌──────────────┐ Storage NIC ┌──────────────────┐ │ External │ ───────────────────► │ Prefill Engine │ │ Storage │ (SATURATED ⚠️) │ (PE) │ │ (KV-Cache) │ │ Loads KV → GPU │ └──────────────┘ │ Computes new KV │ └────────┬─────────┘ │ Transfer KV ▼ ┌──────────────────┐ │ Decode Engine │ │ (DE) │ │ NIC: IDLE 😴 │ │ Generates tokens │ └──────────────────┘

The prefill engine's storage NIC is doing all the heavy lifting — loading potentially gigabytes of KV-Cache per request. Meanwhile, the decode engine's storage NIC sits mostly idle because DEs don't need to load KV-Cache from storage (they receive it from PEs over the compute network). This imbalance is the core problem DualPath addresses.

The DualPath Solution

DualPath's key insight is deceptively simple: use the idle NICs on decode engines as a second path for KV-Cache loading.

The Dual Data Paths

  1. Path 1 (Traditional): Storage → Prefill Engine (existing path, storage NIC on PE)
  2. Path 2 (New): Storage → Decode Engine → RDMA transfer → Prefill Engine (uses idle storage NIC on DE + compute network)
┌──────────────┐ Path 1 (Storage NIC) ┌──────────────────┐ │ External │ ──────────────────────────► │ Prefill Engine │ │ Storage │ │ (PE) │ │ (KV-Cache) │ Path 2 (Storage NIC) │ │ │ │ ──────────┐ └──────────────────┘ └──────────────┘ │ ▲ ▼ │ ┌──────────────────┐ RDMA │ │ Decode Engine │ ──────────────┘ │ (DE) │ (Compute NIC) │ Buffers KV data │ └──────────────────┘ ═══ Result: 2× the effective storage bandwidth to PEs ═══

Three Key Components

1. Dual-Path Data Loading

KV-Cache blocks are split across both paths. The decode engine loads a portion of the KV-Cache from storage into its host memory (or HBM), then transfers it to the prefill engine via RDMA over the compute network. This effectively doubles the available bandwidth for feeding KV-Cache to prefill engines.

2. Block-wise Data Layouts

DualPath uses two data layout strategies — Full Block and Layer Block — optimized for RDMA transfers. These layouts enable GPUDirect RDMA, allowing data to move directly between machines without CPU involvement. The paper notes that an RDMA write can actually be cheaper in latency than a local cudaMemcpyAsync.[3]

3. Global Scheduler

A global scheduler dynamically balances load across prefill and decode engines. It monitors per-engine token counts, read queue depths, and unfinished token thresholds to decide which path each KV-Cache block should take. The scheduler also enforces compute quotas to minimize synchronization bubbles when packing reads alongside compute operations.[1]

🔑 Why It's Elegant DualPath doesn't require exotic hardware or model changes. It exploits existing idle network resources (decode-engine NICs) and uses standard RDMA capabilities already present in most inference clusters. The "dual" in DualPath refers to using two concurrent data paths where only one existed before.

RDMA: The Technology Making This Possible

Remote Direct Memory Access (RDMA) is a networking technology that allows one machine to read from or write to another machine's memory without involving either machine's CPU. This is critical for DualPath because:

In practice, DualPath uses GPUDirect RDMA through the compute NICs (CNICs), allowing KV-Cache data to flow directly from a decode engine's memory to a prefill engine's GPU memory without any CPU copies. The paper uses priority virtual lanes to isolate this traffic from latency-sensitive model parallel communications.

Performance Results

DeepSeek evaluated DualPath on their in-house inference system across three model sizes with production agentic workloads:[1]

MetricBaselineDualPathImprovement
Offline Throughput1.0×Up to 1.87×87% increase
Online Throughput1.0×Avg 1.96×96% average increase
TTFT (Time to First Token)BaselineStableNo regression
Token-to-Token LatencyBaselineStableNo regression
Job Completion TimeBaseline~45.6% reductionJobs finish faster

Model Scale

The evaluation covered models including DeepSeek 27B and DeepSeek 660B, with production-grade agentic workloads. At thousands of GPUs, DualPath showed improved NIC load balance across the cluster — suggesting the approach scales well beyond small deployments.[3]

Prerequisites for Peak Performance

The headline numbers assume specific conditions:[2]

⚠️ Important Caveat Workloads with sparse history reuse, fragmented sessions, or weaker networking will see smaller improvements. Without robust RDMA, added transfers can shift bottlenecks rather than remove them. Organizations should validate under their own workload mixes.[2]

Why AI Agent Builders Should Care

This paper matters for anyone building agentic AI systems. Here's why:

The Agent Loop Problem

Every tool call in an agentic loop adds context. Consider an agent that:

  1. Receives a user query (Turn 1)
  2. Calls a search API, processes results (Turn 2)
  3. Reads a document, extracts key information (Turn 3)
  4. Calls a code execution tool (Turn 4)
  5. Synthesizes and responds (Turn 5)

Each turn requires loading the KV-Cache from all previous turns. By Turn 5, you're loading the accumulated cache from Turns 1–4. As agents become more capable — longer context windows, more tool integrations, deeper reasoning chains — this I/O cost grows linearly with agent complexity.

The Scaling Problem

This isn't just a single-request latency issue. At scale, when you're serving thousands of concurrent agent sessions, the storage bandwidth becomes the hard ceiling on how many agents you can run simultaneously. DualPath effectively raises that ceiling by pooling bandwidth from idle decode-engine NICs.

Infrastructure vs. Architecture

DualPath is notable because it's a pure infrastructure optimization. It doesn't change the model, the attention mechanism, or the prompting strategy. This means it's applicable to any large-scale LLM serving system that uses PD separation — not just DeepSeek's models. If you're running vLLM, TensorRT-LLM, or any disaggregated serving framework, DualPath's principles could be adapted to your stack.

Broader Implications

DeepSeek Is Building a Full-Stack AI Company

This paper signals something important: DeepSeek isn't just publishing model papers. They're publishing infrastructure research — the kind of systems-level optimization that makes their famously cheap API pricing possible. When DeepSeek offers inference at a fraction of competitors' prices, papers like DualPath explain how.[2]

The Industry Shift: "How Do You Serve It?"

We're entering an era where the model itself is increasingly commoditized. The competitive advantage is shifting from "what model do you have?" to "how efficiently can you serve it?" DualPath represents this shift — it's not about making the model smarter, it's about making the infrastructure faster and cheaper.

The $0.01/1M Token Question

Infrastructure optimizations like DualPath, combined with DeepSeek's MoE (Mixture of Experts) architecture and aggressive quantization, explain how inference costs are plummeting. When you can nearly double throughput without adding hardware, you're directly cutting the cost per token. This is the kind of engineering that makes sub-penny-per-million-token pricing a reality.

Open Research Benefits Everyone

By publishing DualPath as a paper (not just deploying it internally), DeepSeek enables the broader community to adopt these techniques. Other inference frameworks — vLLM, SGLang, TensorRT-LLM — could implement similar dual-path loading strategies. This raises the bar for the entire industry.

Limitations & Caveats

Key Takeaways

🎯 For AI Engineers Building Agents
  • The bottleneck in multi-turn agentic inference is KV-Cache I/O, not GPU compute
  • Every tool call adds context, making the I/O bottleneck worse with agent complexity
  • DualPath achieves 1.87× offline and 1.96× online throughput by using idle decode-engine NICs
  • RDMA enables zero-CPU-overhead data transfer between inference engines
  • This is an infrastructure fix — applicable to any PD-separated serving system, not just DeepSeek
  • Peak gains require RDMA interconnect and ≥95% KV-Cache hit rates
  • The industry is shifting from model innovation to serving infrastructure innovation

If you're scaling agentic AI systems, start measuring your KV-Cache I/O bottleneck. Chances are, your GPUs are waiting on storage more than you think. DualPath won't be the last paper in this space — but it's a clear signal that the future of LLM performance is as much about plumbing as it is about parameters.