📺Watch the video version: ThinkSmart.Life/youtube
🎧
Listen to this article
2.3×
CUDA Graphs Speedup
4.6×
H100 FP8 vs A100
<4%
vLLM KV Cache Waste
6.3×
Speculative Decoding Gain

1. Introduction: Why Real-Time AI Latency Matters

When a user types a prompt and waits for a response, every millisecond is a UX decision. For coding assistants, voice interfaces, and real-time recommendation engines, latency is not a secondary concern — it is the product. A language model that generates tokens at 10 tokens/second feels sluggish; one that hits 80+ tokens/second with sub-100ms time-to-first-token (TTFT) feels instantaneous.

But latency and throughput are fundamentally in tension. Optimizing for low latency often means processing fewer requests in parallel (small batch sizes, eager scheduling), while optimizing for throughput means batching aggressively, which introduces queuing delay. Real-time AI serving requires you to navigate this tradeoff explicitly — and the 16 techniques in this playbook give you the full toolkit to do it.

There are two key metrics every serving engineer must internalize:

The techniques in this article are organized by the system layer they address: streaming and parallelism, memory management, hardware-level acceleration, batching strategies, and inference algorithms. Each section includes concrete benchmarks where available. We conclude with a prioritization framework for stacking these techniques in production.

Target audience: ML infrastructure engineers, model serving teams, and developers deploying LLMs at scale. Familiarity with transformer architecture and GPU memory hierarchy assumed.

2. Streaming & Parallelism

Streaming Generation

Streaming generation refers to the practice of transmitting generated tokens to the client as they are produced, rather than buffering the full response. This is architecturally straightforward — the model decodes token by token in an autoregressive loop and each token is flushed to the response stream immediately — but it has a profound effect on perceived latency.

From the user's perspective, a response that begins appearing in 150ms feels far more responsive than one that delivers 500 tokens after a 3-second delay, even if the latter has lower absolute TTFT. Streaming generation is the prerequisite for all other latency optimizations to matter subjectively. Implementation requires server-sent events (SSE) or chunked HTTP transfer on the serving layer, with OpenAI-compatible streaming APIs now the de facto standard.

The key engineering constraint: with streaming active, you cannot post-process or rerank the output before delivery. Any filtering or safety checks must run in-line at token emission time, adding complexity to the serving stack.

Token Parallelism

Standard autoregressive decoding is strictly sequential: token N cannot be generated until token N-1 is complete. Token parallelism attacks this constraint through tensor parallelism (splitting model weights across multiple GPUs so a single forward pass uses all available compute simultaneously) and pipeline parallelism (partitioning transformer layers across GPU stages).

For decoding specifically, tensor parallelism with degree TP=2 or TP=4 is the most common configuration. The tradeoff: TP reduces per-token latency by splitting compute, but introduces all-reduce communication overhead between GPUs. On NVLink-connected systems, this overhead is minimal (<5% for TP=2); on PCIe-only systems, it can dominate. Microsoft's benchmarks on H100 show that for Llama 3.1 8B, running one model copy per GPU achieves higher throughput than tensor-parallel distribution across 8 GPUs — underscoring that TP is a latency, not throughput, optimization for smaller models.

Async Prefill

The prefill phase (processing the input prompt through all transformer layers to populate the KV cache) and the decode phase (autoregressively generating output tokens) have fundamentally different compute profiles. Prefill is compute-bound (large matrix multiplications over the full prompt); decode is memory-bandwidth-bound (loading model weights for each single-token forward pass).

Async prefill (also called "prefill-decode disaggregation") separates these two workloads onto different compute resources. A dedicated prefill server processes incoming prompts and populates the KV cache; the resulting cache is transferred to a decode server that handles token generation. This prevents long prompts from blocking ongoing decode streams and enables independent scaling of each phase. Systems like Mooncake (used by Kimi) and DistServe have demonstrated 2–3× throughput improvements with this disaggregation pattern at scale.

Context Window Streaming

For very long contexts (64K–128K tokens), even efficient attention implementations can create substantial TTFT. Context window streaming addresses this by beginning prefill processing incrementally — chunking the input into segments and beginning to generate tokens once the first N chunks are processed, rather than waiting for the full context to be prefilled.

This technique is particularly valuable for document Q&A, long-form summarization, and code analysis workloads where the user provides a large context and expects an immediate response. The implementation complexity lies in managing partial KV cache states and ensuring the chunked prefill produces numerically identical results to full-context prefill.

3. Memory Systems

PagedAttention

PagedAttention, introduced by the vLLM project, is arguably the single most impactful algorithmic contribution to LLM serving in recent years. Traditional serving frameworks allocated contiguous GPU memory for each request's KV cache at the maximum possible sequence length — causing 60–80% of allocated KV cache memory to be wasted through internal fragmentation and over-provisioning.

PagedAttention borrows the virtual memory paging model from operating systems. KV cache memory is divided into fixed-size blocks (pages), and each request's KV cache is composed of non-contiguous pages that are managed by a block table — similar to an OS page table. This eliminates both internal fragmentation (wasted space within an allocation) and external fragmentation (inability to use small scattered free blocks).

The result: vLLM achieves under 4% KV cache memory waste, compared to 60–80% in prior systems. This dramatically increases the number of concurrent requests that can be served on a given GPU, enabling larger effective batch sizes and higher throughput. The vLLM paper reported 2–4× throughput improvement over HuggingFace Transformers and comparable or better performance vs. FasterTransformer across diverse workloads.

Key stat: PagedAttention reduces KV cache memory waste from 60–80% to under 4%, directly enabling larger batches and higher GPU utilization without any model changes.

KV Cache Quantization

The KV cache stores intermediate attention keys and values for all tokens in active sequences. At long context lengths, the KV cache can dominate GPU memory usage — sometimes exceeding model weight memory. For a 70B parameter model serving 100 concurrent requests at 4K context length, the KV cache in FP16 requires tens of gigabytes.

KV cache quantization reduces this footprint by storing keys and values in lower precision — typically INT8 or even INT4 — while performing attention computation in higher precision. The accuracy impact is generally minimal because attention patterns are relatively robust to quantization noise: the softmax operation that follows the QK dot product naturally compresses small numerical differences.

Implementations in vLLM, TensorRT-LLM, and SGLang support INT8 KV cache quantization with <1% accuracy degradation on standard benchmarks. The memory reduction is linear: INT8 halves KV cache memory vs. FP16, INT4 reduces it by 4×. For memory-constrained deployments this directly translates to 2–4× more concurrent requests on the same hardware.

Memory Offload

When GPU VRAM is insufficient to hold both model weights and the KV cache for large batches, memory offloading moves data between GPU HBM and CPU DRAM (or NVMe SSDs) during inference. The key insight: while all transformer layer weights must be accessed every decode step, the KV cache for completed layers need not remain on GPU between decode steps for sequences that are not currently being processed.

Systems like FlexGen pioneered offloading strategies for single-GPU inference on large models, achieving throughput that would otherwise require multi-GPU setups. The tradeoff is additional latency from PCIe transfer — approximately 16 GB/s for PCIe 4.0 x16 vs. 3.35 TB/s for H100 HBM — so offloading is suitable for latency-tolerant, throughput-first workloads like batch processing and offline inference, not interactive serving.

4. Hardware Acceleration

CUDA Graphs

Each PyTorch operation in a transformer forward pass triggers a sequence of CUDA kernel launches from the CPU. For small batch sizes — common during decode when each step generates only one token per sequence — the overhead of launching dozens of individual CUDA kernels can dominate the actual compute time. CUDA Graphs solve this by recording the entire sequence of GPU operations during a "warmup" pass, then replaying the captured graph as a single GPU-side operation on subsequent calls.

The benchmark is striking: Fireworks AI measured that LLaMA-7B inference without CUDA Graphs executes at 30 tokens/sec; with CUDA Graphs enabled, it reaches 69 tokens/sec — a 2.3× speedup explained entirely by CPU kernel launch overhead reduction. CMU research on MLCEngine found CUDA Graph optimizations contribute up to 10% latency reduction in multi-GPU inference by reducing inter-GPU synchronization variability.

The engineering constraint: CUDA Graphs require static input shapes. For LLM inference, this means pre-capturing graphs for a fixed set of batch size / sequence length combinations (e.g., batch sizes 1, 2, 4, 8, 16, 32). Inputs that don't match a pre-captured shape fall back to eager execution. vLLM and TensorRT-LLM both implement CUDA Graph capture with shape bucketing as a core optimization.

FP8 Kernels

NVIDIA H100 GPUs introduced native hardware support for FP8 (8-bit floating point) matrix multiplications via Tensor Cores. FP8 offers two formats — E4M3 (4 exponent bits, 3 mantissa bits, higher range, good for activations) and E5M2 (5 exponent bits, 2 mantissa bits, higher precision, better for gradients) — and the H100's Tensor Cores can execute FP8 GEMM at roughly twice the FLOP/s of FP16.

The benchmark numbers are remarkable: TensorRT-LLM on H100 FP8 achieves up to 4.6× maximum throughput and 4.4× faster first-token latency compared to A100 FP16. At peak, the system generates over 10,000 output tokens/second at 64 concurrent requests while maintaining a first-token latency of 100ms. Baseten's production data shows FP8 quantization delivers approximately 33% faster inference — at batch size 128, Mistral 7B in FP8 on an H100 generates more than 16,000 total tokens per second.

FP8 inference requires per-tensor or per-channel scaling factors to maintain accuracy, and the quantization calibration process typically uses a small representative dataset. Accuracy degradation on standard benchmarks is generally under 0.5% for FP8 vs. BF16. TensorRT-LLM, vLLM (via modelopt), and SGLang all support FP8 inference on Hopper-architecture GPUs.

GPU-CPU Overlap

Modern LLM serving systems pipeline GPU computation with CPU-side operations to prevent either resource from sitting idle. The key operations to overlap:

vLLM's anatomy blog documents how the system overlaps "prepare inputs" (CPU-side buffer copies and metadata construction) with "forward pass" execution by using CUDA streams with explicit synchronization points. Properly implemented, GPU-CPU overlap reduces end-to-end per-step latency by 5–15% in high-concurrency serving scenarios.

5. Batching Strategies

Dynamic Batching

Static batching — where a fixed number of requests are collected before any processing begins — introduces unnecessary queuing delay. A single long-running request can hold up short requests waiting in the same batch. Dynamic batching improves on this by forming batches adaptively: rather than waiting for exactly N requests, the scheduler forms a batch when either N requests are available or a timeout T expires, whichever comes first.

The timeout parameter (often called "max_wait_ms" or "preferred_batch_size_timeout") is the primary tuning knob. Setting it too low reduces batching efficiency (and throughput); too high increases queuing latency for requests that arrive in low-traffic periods. Production systems typically tune this dynamically based on observed request arrival rate, targeting batch sizes of 8–32 for latency-sensitive workloads and 64–256 for throughput-maximizing batch jobs.

Continuous Batching

Continuous batching (also called "iteration-level scheduling" or "in-flight batching") is the most important batching innovation for LLM serving. In static batching, a batch of N requests occupies GPU resources from prefill through the last token of the longest sequence — even if most sequences finish much earlier. GPU compute sits idle for finished requests while the slowest sequence completes.

Continuous batching solves this at the iteration level: after every decode step, the scheduler checks for newly completed sequences and immediately slots in new requests from the queue. The "batch" is rebuilt every decode iteration, allowing the GPU to stay fully utilized even when request lengths vary dramatically.

Combined with PagedAttention (which makes the memory management for variable-length sequences practical), continuous batching enables vLLM to achieve state-of-the-art throughput. The original vLLM paper demonstrated up to 23× throughput improvement over HuggingFace Transformers naive serving and up to 3.5× vs. Orca (which implemented static continuous batching) on diverse workload mixes.

Key insight: Continuous batching + PagedAttention is the baseline that every competitive LLM serving system builds on. If you're not running both, you're leaving significant throughput on the table.

Request Coalescing

Request coalescing (also called prefix caching or radix attention) exploits a common pattern in LLM serving: many requests share a common prefix — the system prompt. In a typical deployment, every API request prepends a multi-hundred-token system prompt before the user's query. Without prefix caching, each request independently prefills this shared prefix, wasting compute and memory.

With request coalescing, the KV cache for the shared prefix is computed once and reused across all subsequent requests with the same prefix. SGLang's RadixAttention implementation organizes KV cache blocks in a radix tree, enabling efficient sharing of common prefixes and even common suffixes (for few-shot example sequences). The throughput improvement for high-prefix-sharing workloads (chatbots with fixed system prompts, RAG pipelines with shared context) is substantial — often 30–60% reduction in prefill compute for typical deployments.

6. Inference Optimization

Speculative Decoding

Speculative decoding is one of the most theoretically elegant techniques in this playbook. Standard autoregressive decoding generates exactly one token per forward pass through the full model — fundamentally serialized by the autoregressive constraint. Speculative decoding breaks this bottleneck by using a fast "draft" model to propose multiple candidate tokens, which are then verified in a single forward pass through the full "target" model.

The verification step leverages a key mathematical property: the target model can check whether it agrees with each draft token in parallel, because evaluating the probability of a token sequence requires a single forward pass (unlike generating the sequence, which requires N passes). If the target model accepts the draft token, both models move forward; if it rejects, the target model's corrected token is used and the draft sequence is truncated. Crucially, the output distribution is provably identical to standard autoregressive sampling.

The speedup depends on the "acceptance rate" (α) — the fraction of draft tokens accepted by the target model. In practice:

Best results come from domain-specific draft models: a small model fine-tuned on similar data to the target model achieves much higher acceptance rates than a generic small model. For coding tasks, a 7B draft model paired with a 70B target can achieve acceptance rates of 0.7–0.85, translating to effective 2–4× speedups.

Early Exit Heads

Early exit (also called adaptive computation or "anytime inference") allows the model to stop at an earlier transformer layer when confidence in the next token is already high. The intuition: not every token requires the full depth of the model. Simple completions ("The capital of France is") can be resolved by layer 12 of a 40-layer model, while complex reasoning steps require the full forward pass.

Early exit heads attach lightweight classification heads (one linear layer + softmax) to intermediate transformer layers. At inference time, if the maximum probability token at layer L exceeds a threshold τ (e.g., 0.9), decoding stops and that token is emitted. The threshold trades accuracy for speed: higher τ means fewer early exits (more conservative), lower τ means more exits but higher risk of degraded output quality.

Research results vary significantly by task type. For knowledge retrieval and factual completion tasks, early exit heads achieve 30–50% layer reduction with minimal quality impact. For complex reasoning tasks (math, code generation, chain-of-thought), the model needs its full depth and early exit provides little benefit. The technique is best applied as a dynamic, per-token decision rather than a static configuration.

Prefetch Pipelines

Prefetch pipelines address the latency between a request arriving and useful compute beginning. In a naive serving system, a request sits in a queue until a batch slot opens, then the full input is transferred to GPU and prefill begins. Prefetch pipelines start useful work earlier:

Prefetch pipelines are infrastructure-level optimizations that reduce the "dead time" between request arrival and meaningful GPU work. In high-QPS scenarios where requests arrive faster than individual batches complete, this queuing latency can be a significant fraction of overall TTFT. Proper prefetch pipeline implementation can reduce TTFT by 10–25% for typical interactive workloads.

7. Putting It Together: Priority Order & Stacking

With 16 techniques available, the practical question is: where do you start, and how do you stack them? Here is a prioritization framework based on impact-to-implementation-effort ratio.

Tier 1: Foundation (implement first, highest leverage)

  1. Continuous Batching + PagedAttention: Deploy vLLM or SGLang. This single step often delivers 5–20× throughput improvement over naive HuggingFace serving. No model changes required.
  2. CUDA Graphs: Enable via --enforce-eager=False in vLLM (default on). 2.3× decode speedup for free. Requires shape bucketing but frameworks handle this automatically.
  3. FP8 Quantization (on H100+): Use TensorRT-LLM or vLLM's FP8 support. 33–46% throughput gain with minimal accuracy impact. Requires calibration dataset but the process is well-automated.
  4. Streaming Generation: Enable SSE or streaming API. Zero throughput cost, massive perceived latency improvement for end users.

Tier 2: Memory Optimization (unlock scale)

  1. KV Cache Quantization (INT8): Halves KV cache memory, doubles concurrent request capacity. Enable via vLLM's --kv-cache-dtype fp8 flag.
  2. Request Coalescing / Prefix Caching: Enable via SGLang's RadixAttention or vLLM's --enable-prefix-caching. High-value for deployments with fixed system prompts.
  3. Dynamic Batching: Tune max_wait_ms based on traffic patterns. Start with 5–20ms for interactive workloads.

Tier 3: Algorithmic Acceleration (high reward, higher effort)

  1. Speculative Decoding: Deploy once you have a high-quality draft model. Requires a matched small/large model pair and careful γ tuning. Highest latency gains (2–6×) for the right workloads.
  2. Async Prefill (Disaggregation): Complex infrastructure change. Worth it at 10K+ QPS where prefill and decode bottlenecks are clearly measured separately.
  3. Token Parallelism: Add TP=2 or TP=4 for latency reduction on large models (>70B). Not worth the communication overhead for smaller models on single GPUs.

Tier 4: Fine-Tuning (squeeze the last %)

  1. GPU-CPU Overlap: Already implemented in mature frameworks. Verify your framework version includes it.
  2. Context Window Streaming: For 32K+ context workloads with responsive UX requirements.
  3. Early Exit Heads: Requires model architecture modification. Best for factual retrieval workloads.
  4. Memory Offload: For batch/offline workloads where you need to run larger models than VRAM allows.
  5. Prefetch Pipelines: Framework-level optimization; verify your serving framework implements it before building custom.
  6. Request Coalescing (advanced): Multi-tenant prefix sharing across users — complex cache invalidation, but powerful for shared infrastructure.

Stacking Compatibility Matrix

Most of these techniques are orthogonal and can be stacked freely. Key compatibility notes:

🎯 The Practical Stack for 2026

For most production LLM serving deployments, the highest-leverage combination is: vLLM or SGLang with continuous batching + PagedAttention + FP8 (on H100) + CUDA Graphs + prefix caching + INT8 KV cache + speculative decoding (where a draft model is available).

This stack, properly configured, can achieve 10–50× throughput improvement over naive serving while maintaining <100ms TTFT for typical interactive workloads. The gap between a tuned and untuned serving setup is larger than the gap between H100 and A100 hardware — optimization pays more than hardware upgrades.

Measure your actual bottleneck (compute-bound vs. memory-bandwidth-bound vs. network-bound) before adding complexity. Profile first; optimize second.

8. References

  1. BentoML LLM Inference Handbook — Speculative Decoding. https://bentoml.com/llm/inference-optimization/speculative-decoding
  2. NVIDIA Technical Blog — An Introduction to Speculative Decoding for Reducing Latency in AI Inference (October 2025). https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
  3. Snowflake Engineering Blog — Fastest Speculative Decoding in vLLM with Arctic. https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/
  4. arXiv:2509.04474 — A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling. https://www.arxiv.org/pdf/2509.04474
  5. vLLM Project — GitHub Repository. https://github.com/vllm-project/vllm
  6. RunPod Blog — Introduction to vLLM and PagedAttention. https://www.runpod.io/blog/introduction-to-vllm-and-pagedattention
  7. vLLM Blog — Inside vLLM: Anatomy of a High-Throughput LLM Inference System (September 2025). https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html
  8. NVIDIA TensorRT-LLM — H100 vs A100 Performance (FP8 4.6× throughput benchmark). https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html
  9. Baseten Blog — 33% Faster LLM Inference with FP8 Quantization (March 2024). https://www.baseten.co/blog/33-faster-llm-inference-with-fp8-quantization/
  10. Fireworks AI Blog — Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning (2.3× LLaMA-7B speedup). https://fireworks.ai/blog/speed-python-pick-two-how-cuda-graphs-enable-fast-python-code-for-deep-learning
  11. CMU CSD PhD Blog — Optimizing and Characterizing High-Throughput Low-Latency LLM Inference in MLCEngine. https://www.cs.cmu.edu/~csd-phd-blog/2024/low-latency-llm-serving/
  12. arXiv:2504.11750 — Characterizing and Optimizing LLM Inference. https://arxiv.org/pdf/2504.11750
  13. Microsoft Azure Blog — Performance of Llama 3.1 8B AI Inference using vLLM on ND-H100-v5 (August 2025). https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/performance-of-llama-3-1-8b-ai-inference-using-vllm-on-nd-h100-v5/4448355
  14. SimpliSmart Blog — Optimizing GLM-4.6 Inference on H100 GPUs: FP8, MTP, and High-Throughput Serving. https://simplismart.ai/blog/glm-46-simplismart