Introduction
If you're building AI agents that use tool calls, multi-turn reasoning, or long context windows, you've probably assumed that GPU compute is your primary scaling bottleneck. DeepSeek's new paper, "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference", says you're wrong.
The real bottleneck? KV-Cache loading from external storage. Every time your agent starts a new turn in a multi-round conversation, the entire KV-Cache — the cached key-value pairs from previous tokens — must be loaded from disk or network storage back into GPU memory. As agents grow more capable with longer contexts and more tool-call loops, this I/O cost becomes the dominant factor limiting throughput.
Published on February 25, 2026 by a team of 13 researchers from DeepSeek, DualPath is not a new model architecture. It's an infrastructure-level optimization that exploits idle network bandwidth on decode engines to create a second data path for KV-Cache loading. The result: up to 1.87× improvement in offline inference throughput and 1.96× in online serving — all without violating latency service-level objectives (SLOs).[1]
The Real Bottleneck: Data Movement, Not Computation
Modern LLM inference uses a mechanism called KV-Cache to avoid recomputing attention over previously processed tokens. In a multi-turn conversation, the KV-Cache from prior turns is saved to external storage (typically distributed storage systems) and reloaded when the next turn begins. This is essential for efficiency — without it, every turn would require re-processing the entire conversation from scratch.
The problem emerges at scale. In DeepSeek's production environment, KV-Cache hit rates approach 95–99% — meaning the vast majority of tokens in a new turn have already been processed and cached.[3] That's great for compute savings, but terrible for I/O: all those cached KV pairs must be loaded from storage into GPU memory before the prefill engine can process the new tokens.
As Jeff Dean famously put it: "Computation is cheap; data movement is expensive."[2]
The numbers tell the story. In disaggregated inference architectures where prefill and decode happen on separate engine pools, the storage NICs on prefill engines become bandwidth-saturated while the NICs on decode engines sit idle. This asymmetry creates a fundamental bottleneck: no matter how many GPUs you throw at the problem, throughput is capped by how fast you can feed data to the prefill engines.
Prefill-Decode Separation: The Architecture That Creates the Problem
To understand DualPath, you first need to understand PD separation (Prefill-Decode disaggregation) — the prevailing architecture for large-scale LLM serving.
How PD Separation Works
- Prefill Engines (PEs) handle the initial processing of input tokens — computing the KV-Cache for new tokens and running the first forward pass
- Decode Engines (DEs) handle autoregressive token generation — producing output tokens one at a time using the cached KV pairs
- This separation allows each engine type to be optimized independently: PEs for throughput (batch processing), DEs for latency (fast per-token generation)
Where the Bottleneck Forms
In agentic workloads with high KV-Cache reuse, the flow looks like this:
The prefill engine's storage NIC is doing all the heavy lifting — loading potentially gigabytes of KV-Cache per request. Meanwhile, the decode engine's storage NIC sits mostly idle because DEs don't need to load KV-Cache from storage (they receive it from PEs over the compute network). This imbalance is the core problem DualPath addresses.
The DualPath Solution
DualPath's key insight is deceptively simple: use the idle NICs on decode engines as a second path for KV-Cache loading.
The Dual Data Paths
- Path 1 (Traditional): Storage → Prefill Engine (existing path, storage NIC on PE)
- Path 2 (New): Storage → Decode Engine → RDMA transfer → Prefill Engine (uses idle storage NIC on DE + compute network)
Three Key Components
1. Dual-Path Data Loading
KV-Cache blocks are split across both paths. The decode engine loads a portion of the KV-Cache from storage into its host memory (or HBM), then transfers it to the prefill engine via RDMA over the compute network. This effectively doubles the available bandwidth for feeding KV-Cache to prefill engines.
2. Block-wise Data Layouts
DualPath uses two data layout strategies — Full Block and Layer Block — optimized for RDMA transfers. These layouts enable GPUDirect RDMA, allowing data to move directly between machines without CPU involvement. The paper notes that an RDMA write can actually be cheaper in latency than a local cudaMemcpyAsync.[3]
3. Global Scheduler
A global scheduler dynamically balances load across prefill and decode engines. It monitors per-engine token counts, read queue depths, and unfinished token thresholds to decide which path each KV-Cache block should take. The scheduler also enforces compute quotas to minimize synchronization bubbles when packing reads alongside compute operations.[1]
RDMA: The Technology Making This Possible
Remote Direct Memory Access (RDMA) is a networking technology that allows one machine to read from or write to another machine's memory without involving either machine's CPU. This is critical for DualPath because:
- Zero CPU overhead: Data moves directly between network card and memory, freeing the CPU for other work
- Ultra-low latency: Bypasses the OS network stack entirely — typical RDMA latency is 1–2 microseconds vs. 10–50 microseconds for TCP/IP
- No interference: DualPath routes KV-Cache transfers over the compute network's RDMA fabric, which is separate from the storage network — so latency-critical model execution communications (attention collectives, all-reduce operations) are not affected[1]
In practice, DualPath uses GPUDirect RDMA through the compute NICs (CNICs), allowing KV-Cache data to flow directly from a decode engine's memory to a prefill engine's GPU memory without any CPU copies. The paper uses priority virtual lanes to isolate this traffic from latency-sensitive model parallel communications.
Performance Results
DeepSeek evaluated DualPath on their in-house inference system across three model sizes with production agentic workloads:[1]
| Metric | Baseline | DualPath | Improvement |
|---|---|---|---|
| Offline Throughput | 1.0× | Up to 1.87× | 87% increase |
| Online Throughput | 1.0× | Avg 1.96× | 96% average increase |
| TTFT (Time to First Token) | Baseline | Stable | No regression |
| Token-to-Token Latency | Baseline | Stable | No regression |
| Job Completion Time | Baseline | ~45.6% reduction | Jobs finish faster |
Model Scale
The evaluation covered models including DeepSeek 27B and DeepSeek 660B, with production-grade agentic workloads. At thousands of GPUs, DualPath showed improved NIC load balance across the cluster — suggesting the approach scales well beyond small deployments.[3]
Prerequisites for Peak Performance
The headline numbers assume specific conditions:[2]
- High KV-Cache hit rates (≥95%): Agentic multi-turn workloads naturally produce these rates, but single-turn or low-reuse workloads will see smaller gains
- RDMA-capable interconnect: The compute network must support RDMA (InfiniBand or RoCE)
- Sufficient storage bandwidth: The total storage system must be able to feed both paths
- PD-separated architecture: DualPath specifically targets disaggregated prefill-decode setups
Why AI Agent Builders Should Care
This paper matters for anyone building agentic AI systems. Here's why:
The Agent Loop Problem
Every tool call in an agentic loop adds context. Consider an agent that:
- Receives a user query (Turn 1)
- Calls a search API, processes results (Turn 2)
- Reads a document, extracts key information (Turn 3)
- Calls a code execution tool (Turn 4)
- Synthesizes and responds (Turn 5)
Each turn requires loading the KV-Cache from all previous turns. By Turn 5, you're loading the accumulated cache from Turns 1–4. As agents become more capable — longer context windows, more tool integrations, deeper reasoning chains — this I/O cost grows linearly with agent complexity.
The Scaling Problem
This isn't just a single-request latency issue. At scale, when you're serving thousands of concurrent agent sessions, the storage bandwidth becomes the hard ceiling on how many agents you can run simultaneously. DualPath effectively raises that ceiling by pooling bandwidth from idle decode-engine NICs.
Infrastructure vs. Architecture
DualPath is notable because it's a pure infrastructure optimization. It doesn't change the model, the attention mechanism, or the prompting strategy. This means it's applicable to any large-scale LLM serving system that uses PD separation — not just DeepSeek's models. If you're running vLLM, TensorRT-LLM, or any disaggregated serving framework, DualPath's principles could be adapted to your stack.
Broader Implications
DeepSeek Is Building a Full-Stack AI Company
This paper signals something important: DeepSeek isn't just publishing model papers. They're publishing infrastructure research — the kind of systems-level optimization that makes their famously cheap API pricing possible. When DeepSeek offers inference at a fraction of competitors' prices, papers like DualPath explain how.[2]
The Industry Shift: "How Do You Serve It?"
We're entering an era where the model itself is increasingly commoditized. The competitive advantage is shifting from "what model do you have?" to "how efficiently can you serve it?" DualPath represents this shift — it's not about making the model smarter, it's about making the infrastructure faster and cheaper.
The $0.01/1M Token Question
Infrastructure optimizations like DualPath, combined with DeepSeek's MoE (Mixture of Experts) architecture and aggressive quantization, explain how inference costs are plummeting. When you can nearly double throughput without adding hardware, you're directly cutting the cost per token. This is the kind of engineering that makes sub-penny-per-million-token pricing a reality.
Open Research Benefits Everyone
By publishing DualPath as a paper (not just deploying it internally), DeepSeek enables the broader community to adopt these techniques. Other inference frameworks — vLLM, SGLang, TensorRT-LLM — could implement similar dual-path loading strategies. This raises the bar for the entire industry.
Limitations & Caveats
- RDMA dependency: Not all inference clusters have RDMA-capable interconnects. Cloud GPU instances may not expose RDMA capabilities.
- PD separation required: DualPath targets disaggregated architectures. Monolithic inference setups (common in smaller deployments) won't benefit directly.
- High cache reuse assumed: The best results require ≥95% KV-Cache hit rates. Short, single-turn conversations won't see significant gains.
- Added complexity: The global scheduler and dual-path coordination add operational complexity. Misconfiguration could shift bottlenecks rather than removing them.
- DeepSeek-internal evaluation: Results are from DeepSeek's proprietary inference system. Real-world performance on other stacks may differ.
Key Takeaways
- The bottleneck in multi-turn agentic inference is KV-Cache I/O, not GPU compute
- Every tool call adds context, making the I/O bottleneck worse with agent complexity
- DualPath achieves 1.87× offline and 1.96× online throughput by using idle decode-engine NICs
- RDMA enables zero-CPU-overhead data transfer between inference engines
- This is an infrastructure fix — applicable to any PD-separated serving system, not just DeepSeek
- Peak gains require RDMA interconnect and ≥95% KV-Cache hit rates
- The industry is shifting from model innovation to serving infrastructure innovation
If you're scaling agentic AI systems, start measuring your KV-Cache I/O bottleneck. Chances are, your GPUs are waiting on storage more than you think. DualPath won't be the last paper in this space — but it's a clear signal that the future of LLM performance is as much about plumbing as it is about parameters.
References
- DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference — Wu et al., arXiv:2602.21548, February 2026
- DualPath lifts throughput as RDMA eases KV-cache I/O — BitcoinEthereumNews, February 2026
- DualPath: Analysis, Review & Summary — Paperium, February 2026
- @BoWang87 tweet on DualPath paper — X/Twitter, February 2026
- DeepSeek DualPath: Stable TTFT under load — 36Kr, February 2026
- RDMA Core Documentation — NVIDIA Networking
- Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al. (vLLM), 2023
- DeepSeek — Open-source AI research — DeepSeek
- vLLM: Easy, fast, and cheap LLM serving — GitHub
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek, 2024