⚡ vLLM: The Production LLM Inference Engine

How UC Berkeley's PagedAttention breakthrough became the go-to serving engine for production AI — and what v0.17.0 brings to the table.

March 7, 2026 · 22 min read · Infrastructure, AI Inference

📺 Watch the video version:

24×

vs HuggingFace Transformers throughput

272

contributors in v0.17.0

<4%

memory waste (vs 60–80% in naive systems)

v0.17.0

latest release (March 2026)

What is vLLM?

vLLM is an open-source, high-throughput and memory-efficient inference and serving engine for large language models. It's the difference between an LLM deployment that serves 10 requests per second and one that serves 200 — on the exact same hardware.

At its core, vLLM solves the hardest problem in LLM serving: KV cache memory management. When a model generates tokens, every input token creates a key-value tensor that must stay in GPU memory until generation is complete. These tensors are massive (up to 1.7 GB for a single sequence in LLaMA-13B), dynamic in size, and traditionally wasted 60–80% of GPU memory due to fragmentation and over-reservation.

vLLM introduced PagedAttention — a new attention algorithm inspired by OS virtual memory and paging — that eliminated nearly all of that waste. The result: dramatically higher throughput, better GPU utilization, and the ability to serve far more concurrent users on the same hardware.

The one-sentence pitch: vLLM is what you run when you need to serve LLMs to real users at scale — it's the production-grade inference server that separates toy demos from real deployments.

Origin Story

vLLM was born at the UC Berkeley Sky Computing Lab — the same group responsible for foundational systems research including Spark, Ray, and CRDT. The project was created by Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, and Lianmin Zheng, with faculty advisors Joseph E. Gonzalez, Hao Zhang, and Ion Stoica.

The founding motivation was practical: the team was running Chatbot Arena and Vicuna Demo at lmsys.org and discovered that even on expensive hardware, LLM serving was painfully slow and wasteful. They needed a way to serve thousands of users on limited compute — not a research toy, but something production-grade.

The project launched publicly in June 2023. By the time the SOSP 2023 paper ("Efficient Memory Management for Large Language Model Serving with PagedAttention") was published, vLLM had already become the de facto inference engine for anyone serious about LLM serving.

The SOSP paper formalized what the project had demonstrated empirically: PagedAttention achieves near-zero memory waste (under 4%) and enables 24× higher throughput than HuggingFace Transformers for common serving scenarios.

From Research Lab to Industry Standard

The trajectory was fast. Within months of launch, major AI companies — Anyscale, Replicate, NVIDIA, IBM, and others — adopted vLLM as their inference backbone. The open-source community exploded. By v0.17.0 (March 2026), the project had accumulated 699 commits from 272 contributors in a single release cycle — an extraordinary velocity for a systems project.

Core Architecture

vLLM's architecture is built around three core design decisions: PagedAttention for memory management, continuous batching for throughput, and an asynchronous engine design for production reliability.

System Overview

At a high level, vLLM consists of:

LLM Engine — The core orchestrator that manages the request lifecycle, scheduling, and worker coordination
Scheduler — Decides which requests to process, preempt, or swap based on available memory and priority
Block Manager — Manages the physical KV cache blocks on GPU/CPU memory using the PagedAttention model
Model Executor — Runs the actual model inference, supporting single-GPU, tensor-parallel, and pipeline-parallel execution
OpenAI-compatible API Server — A FastAPI-based server with drop-in compatibility for OpenAI's /v1/chat/completions and /v1/completions endpoints

The design separates the control plane (scheduling, memory management) from the data plane (model execution), enabling vLLM to make smart scheduling decisions without blocking on model execution.

PagedAttention: The Core Innovation

PagedAttention is the insight that made vLLM possible. To understand why it matters, you need to understand how naive LLM serving wastes memory.

The Problem with Naive KV Cache Management

In transformer models, every input token generates a key and value tensor that must remain in GPU memory throughout the generation process (the KV cache). The problem:

Size uncertainty: You don't know how long a sequence will be until it's done generating. Traditional systems pre-allocate the maximum possible context length.
Contiguous allocation: KV cache for each sequence must occupy contiguous GPU memory, leading to fragmentation as requests of different sizes come and go.
No sharing: When multiple requests share a system prompt (common in API deployments), that prompt's KV cache is duplicated for every request.

The result: systems waste 60–80% of GPU memory, severely limiting how many requests can run concurrently.

The PagedAttention Solution

PagedAttention borrows from classical OS concepts — virtual memory and paging — and applies them to KV cache management:

Pages instead of contiguous allocation: KV cache is divided into fixed-size blocks (typically 16 tokens each). Blocks don't need to be contiguous in physical memory.
Block table mapping: Each sequence has a logical block table that maps its logical blocks to physical memory blocks, just like OS page tables map virtual to physical addresses.
On-demand allocation: Physical blocks are only allocated when new tokens are actually generated, eliminating over-reservation.
Copy-on-write prefix sharing: Multiple requests can share the same physical blocks for common prefixes (like system prompts). Blocks are only copied when a request needs to modify them.

The memory waste drops from 60–80% to under 4% — essentially just the last (partially filled) block of each sequence. This is transformative for multi-user serving: vLLM can fit far more concurrent requests into the same GPU memory.

# The core idea in pseudocode:
# Old way: allocate max_context_length * KV_size per request
# PagedAttention way:
class KVCacheManager:
    block_size = 16  # tokens per block
    free_blocks = [...]  # pool of physical blocks

    def allocate(self, sequence):
        # Only allocate one block at a time as needed
        block = self.free_blocks.pop()
        sequence.block_table.append(block)

    def share_prefix(self, seq1, seq2, shared_prefix_len):
        # Point both sequences at the same physical blocks
        shared_blocks = seq1.block_table[:shared_prefix_len // block_size]
        seq2.block_table[:len(shared_blocks)] = shared_blocks
        # Copy-on-write: only copy when a sequence modifies a shared block

Continuous Batching

PagedAttention solves memory. Continuous batching solves throughput.

The Static Batching Problem

Traditional LLM serving used static batching: wait until you have N requests, batch them together, run inference, return results. The problem is that sequences in a batch have different lengths. Short sequences finish quickly — but the GPU sits idle waiting for the longest sequence to complete before it can start new work. GPU utilization craters.

Continuous (Iteration-Level) Batching

vLLM uses continuous batching (sometimes called iteration-level scheduling): at each forward pass (token generation step), the scheduler looks at all pending requests and adds newly arrived requests to the batch. Requests that have completed are immediately removed. There's no waiting for a "batch" to fill — the GPU stays busy doing useful work every single iteration.

This is the reason vLLM's throughput numbers look so dramatically better than HuggingFace Transformers (which uses static batching by default): continuous batching can achieve 3–24× higher throughput depending on request mix, simply by keeping the GPU occupied.

Preemption and Swapping

When GPU memory gets full, vLLM doesn't just drop requests. The scheduler can:

Preempt low-priority requests by swapping their KV cache blocks to CPU memory
Resume them later when GPU memory becomes available
Recompute KV cache for very short sequences instead of swapping (cheaper)

Quantization Support

vLLM supports a comprehensive suite of quantization formats, making it viable across a wide range of hardware and latency/quality tradeoffs:

Format	Bits	Speed	Quality Loss	Best For
FP16/BF16	16-bit	Baseline	None	Max quality, A100/H100
FP8	8-bit	1.5–2× faster	Minimal	Production serving, H100/H200
INT8 (W8A8)	8-bit	1.4–1.8× faster	Very low	Production, A100
GPTQ	4-bit	~2× faster	Low–Medium	Consumer GPUs, memory constrained
AWQ	4-bit	~2× faster	Low	Best quality at 4-bit
QLoRA	4-bit + LoRA	Good	Very low	Fine-tuned models (v0.17.0+)
SqueezeLLM	4-bit sparse	~2× faster	Very low	High accuracy 4-bit

As of v0.17.0, vLLM can now directly load quantized LoRA adapters (QLoRA) without requiring a separate dequantize-then-apply step — a significant workflow improvement for fine-tuned model serving.

v0.17.0 — What's New (March 2026)

⚡ vLLM v0.17.0 — March 7, 2026

699 commits · 272 contributors (48 new) · PyTorch 2.10

v0.17.0 is a major milestone release that advances vLLM from "great production serving engine" to "complete large-scale inference platform." Here are the headline features:

FlashAttention 4 Integration

vLLM now supports the FlashAttention 4 backend — the next generation of the attention optimization that changed how transformers run on GPUs.

FlashAttention's key insight is tiling: instead of materializing the full attention matrix (which is O(n²) in sequence length), it computes attention in tiles that fit in SRAM, avoiding slow HBM reads. FlashAttention 4 extends this with:

Warp specialization on Hopper/Blackwell GPUs — different warps handle data movement vs compute simultaneously, hiding memory latency
Better pipelining of the softmax and matrix multiplications
Native Blackwell support with H200/B200 optimizations

For long-context workloads (32K+ token sequences), FlashAttention 4 can deliver 1.5–2× speedup over FlashAttention 2 on modern NVIDIA hardware.

Hardware note: FlashAttention 4 benefits are most pronounced on H100/H200/Blackwell hardware. On A100 and older cards, FlashAttention 2 (already included) is the right choice.

Model Runner V2 Maturation

Model Runner V2 is vLLM's next-generation execution engine, and v0.17.0 marks its major maturity milestone. Key capabilities added:

⚙️
Pipeline Parallelism Distribute model layers across multiple GPUs in a pipeline, enabling models that exceed single-GPU memory to run across multiple cards without full tensor parallelism overhead.
🔄
Decode Context Parallelism Parallelizes the decode phase across multiple GPUs, reducing time-to-next-token for long-running generations. This is separate from tensor parallelism and can be combined with it.
🚀
Eagle3 Speculative Decoding with CUDA Graphs Speculative decoding (draft model proposes tokens, main model verifies in parallel) now works with CUDA graph capture in Model Runner V2, eliminating graph replay overhead.
📦
Piecewise & Mixed CUDA Graph Capture CUDA graphs can now be captured for heterogeneous batches (prefill + decode mixed), and piecewise capture allows graphs to be built incrementally rather than all at once.
🏗️
New ModelState Architecture A cleaner abstraction for managing model state across different parallelism strategies. Design docs now available in the repository.
🤝
DP+EP for Speculative Decoding Data Parallelism combined with Expert Parallelism now works with speculative decoding — enabling MoE models like DeepSeek to use speculative decoding at scale.

The --performance-mode Flag

One of the most user-facing changes in v0.17.0: a new --performance-mode flag that simplifies the previously opaque performance tuning process.

vllm serve meta-llama/Llama-3-8B \
  --performance-mode throughput   # maximize requests/sec

vllm serve meta-llama/Llama-3-8B \
  --performance-mode interactivity  # minimize time-to-first-token

vllm serve meta-llama/Llama-3-8B \
  --performance-mode balanced  # default: balance both

Mode	Optimizes For	Best Use Case
`throughput`	Max requests/second, GPU utilization	Batch processing, offline inference, high-volume APIs
`interactivity`	Time-to-first-token (TTFT), latency	Chat interfaces, real-time applications, low-latency requirements
`balanced`	Both TTFT and throughput	General-purpose serving, mixed workloads

Previously, achieving these different modes required manual tuning of --max-num-batched-tokens, --max-num-seqs, --gpu-memory-utilization, and other low-level parameters. The new flag abstracts this into a single, declarative choice.

Weight Offloading V2 with Prefetching

For deployments where the model doesn't fit entirely in GPU memory, Weight Offloading V2 introduces prefetching — the CPU starts loading the next layer's weights into GPU memory while the GPU is still computing the current layer.

This dramatically reduces the "I/O stall" that makes weight offloading painful in practice. Additional improvements:

Selective CPU offloading: Instead of offloading all weights, you can offload specific layers (e.g., early layers that see less reuse) while keeping hot layers in GPU VRAM
No pinned memory doubling: Previous offloading implementations required pinning twice the model size in CPU RAM (one copy for storage, one for transfer buffers). V2 eliminates this overhead.

The practical impact: you can now run 70B+ parameter models on consumer hardware with 24 GB VRAM (like RTX 3090) and get usable throughput — not just "technically works."

Elastic Expert Parallelism (Milestone 2)

Mixture-of-Experts (MoE) models like Mixtral, DeepSeek, and the new Qwen3.5 family activate only a fraction of their parameters per token. This creates an imbalance: different GPUs may be doing wildly different amounts of work depending on which experts are activated.

Elastic Expert Parallelism allows dynamic GPU scaling for MoE models: experts can be redistributed across GPUs at runtime based on activation frequency and load. If expert 3 is getting hammered with 80% of requests while expert 7 is idle, the system can move compute resources accordingly.

Milestone 2 in v0.17.0 adds the initial infrastructure for this capability. Full dynamic rebalancing is on the roadmap.

Qwen3.5 + GDN (Gated Delta Networks)

v0.17.0 adds full support for the Qwen3.5 model family, which introduces Gated Delta Networks (GDN) — a hybrid architecture combining attention layers with recurrent-style delta update layers for more efficient long-context processing.

vLLM's Qwen3.5 support includes:

FP8 quantization support for Qwen3.5 MoE variants
MTP (Multi-Token Prediction) speculative decoding
Reasoning parser integration
Qwen3-ASR realtime streaming for audio transcription

Use Cases

Enterprise AI APIs

vLLM's OpenAI-compatible API means any application using the OpenAI SDK works out of the box. Companies like Anyscale (now part of Databricks), Replicate, and Together.ai run vLLM as their inference backbone. The continuous batching architecture means you can handle thousands of concurrent API requests without linear GPU scaling.

# Drop-in OpenAI API replacement
from openai import OpenAI

client = OpenAI(
    base_url="http://your-vllm-server:8000/v1",
    api_key="not-required"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Local Multi-User Serving

For teams running their own inference infrastructure — internal tools, developer portals, or privacy-sensitive deployments — vLLM is the right choice when you have more than a handful of concurrent users. A single A100 80GB running vLLM can handle 50–200+ concurrent users on a 13B model depending on context length.

High-Context Research Workloads

With FlashAttention 4 and long-context model support (Qwen2.5-1M, Llama 3.1 128K, etc.), vLLM handles document-level tasks that would OOM naive serving setups: contract analysis, codebase summarization, multi-document research.

ASR (Automatic Speech Recognition)

v0.17.0 expanded vLLM beyond text-to-text. It now supports ASR models including FunASR, FireRedASR2, and Qwen3-ASR with realtime streaming — turning vLLM into a unified multimodal inference server for both text and audio workloads.

Multi-Model Serving

vLLM supports LoRA adapter hot-swapping — you can serve a single base model with multiple fine-tuned adapters loaded dynamically. This is transformative for organizations running dozens of task-specific models: instead of running 20 separate inference servers, you run one vLLM instance with 20 LoRA adapters loaded on demand.

vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-lora \
  --lora-modules \
    customer-service=/path/to/cs-lora \
    code-review=/path/to/code-lora \
    summarizer=/path/to/sum-lora \
  --max-loras 4  # how many to keep loaded simultaneously

Competitor Comparison

vLLM doesn't exist in a vacuum. Here's how it stacks up against every major alternative and when you'd choose each:

Engine	Best At	Weaknesses	When to Use
vLLM	Multi-user throughput, broad model support, active ecosystem	Complex setup, NVIDIA-focused, not edge-friendly	Production serving, 5+ concurrent users, API backends
TensorRT-LLM	Single-GPU peak performance on NVIDIA hardware	Complex engine compilation, NVIDIA-only, limited model support	Maximum latency optimization on supported NVIDIA GPUs
Ollama	Ease of use, single-user local serving, macOS/Apple Silicon	Low throughput under concurrency, less GPU optimization	Personal use, developer testing, Mac hardware
llama.cpp	CPU inference, edge deployment, GGUF format ecosystem	Lower GPU throughput vs dedicated engines	No GPU, ARM devices, Raspberry Pi, Jetson
HuggingFace TGI	HuggingFace ecosystem integration, managed inference API	Lower throughput than vLLM, less flexible	Already deep in HuggingFace infra, managed cloud serving
DeepSpeed Inference	Very large models (100B+), ZeRO memory optimization	Complex setup, Microsoft-centric, lower throughput for smaller models	175B+ models, multi-node inference, Microsoft Azure
SGLang	Structured generation (JSON/grammars), RadixAttention for shared prefixes	Smaller community, fewer models	Agentic workloads, constrained output, prompt reuse

Performance Numbers in Context

Raw throughput comparisons from various benchmarks (LLaMA-class models, NVIDIA A100 80GB, mixed request lengths):

TensorRT-LLM: ~180–220 req/sec (optimized for single setup, NVIDIA-only)
vLLM: ~120–160 req/sec (broad hardware support, large model variety)
HuggingFace TGI: ~100–140 req/sec
Ollama: ~15–30 req/sec (optimized for single-user local use)
HuggingFace Transformers (naive): ~8–15 req/sec

The nuance: TensorRT-LLM can beat vLLM's peak throughput on specific NVIDIA GPUs with specific models — but requires a multi-hour engine compilation step and fails with unsupported model architectures. vLLM runs nearly any HuggingFace model out of the box.

Strengths

✅ Strengths

Best throughput for multi-user concurrent serving
Near-universal model support (any HuggingFace model)
Drop-in OpenAI API compatibility
Active development (272 contributors, monthly releases)
Broad hardware support (NVIDIA, AMD ROCm, Intel Gaudi, CPU)
Production-tested (Anyscale, Together.ai, Replicate)
LoRA adapter hot-swapping for multi-model serving
Comprehensive quantization support (FP8, GPTQ, AWQ, QLoRA)
FlashAttention 4 for long-context performance
Speculative decoding to reduce latency
Multi-modal (text + vision + audio in one server)

⚠️ Weaknesses

Complex setup vs Ollama (Linux, CUDA, Python environment)
Overkill for single-user local use
TensorRT-LLM beats it on peak single-GPU latency (specific setups)
No native macOS/Apple Silicon support
Heavier resource footprint than llama.cpp
Edge/embedded deployment not a design goal
Documentation can lag behind fast release cadence

When NOT to Use vLLM

vLLM is not the right tool for every job. Be clear-eyed about its limitations:

Single-User Local Inference

If you're running models for personal use — querying a model yourself, one request at a time — Ollama is the right tool. vLLM's continuous batching provides zero benefit when you're the only user. You'll get similar or better individual response quality from Ollama with a fraction of the setup complexity.

Apple Silicon / macOS

vLLM requires CUDA (NVIDIA) or ROCm (AMD) and runs on Linux. If your hardware is a Mac with Apple Silicon, use Ollama (which uses Apple's MLX/Metal backends natively) or llama.cpp with Metal support. Apple's unified memory architecture means MBP M3 Ultra can run 70B+ models with excellent throughput for personal use cases.

Edge / IoT Deployment

Raspberry Pi? Jetson Nano? Embedded systems? llama.cpp is your answer. It runs on anything from a $35 Pi to a $500 Jetson with C++ efficiency, no Python runtime required.

Peak Latency on Specific NVIDIA GPUs

If you have a single H100/H200 and need to minimize time-to-first-token above all else for a specific model, TensorRT-LLM with a compiled engine can beat vLLM by 15–25%. The tradeoff: hours of engine compilation, NVIDIA-only, no flexibility to change models without recompiling.

Relevance to Our Local AI Stack

We currently run Ollama on a 4× RTX 3090 GPU rig as our local inference backend. Here's an honest assessment of when and whether we'd switch to vLLM.

What We're Running (Ollama Setup)

Hardware: 4× RTX 3090 (96 GB VRAM total)
Primary workloads: OpenClaw agent tasks, TTS (Kokoro), embedding generation
Concurrency: Low — primarily our own agent workloads, not serving external users
Models: Qwen3.5:35b-a3b, various coding models, embedding models

When Ollama Is the Right Answer (Now)

For our current workload — single-agent orchestration, low concurrency, ease of model management — Ollama wins on simplicity. ollama pull qwen3.5:35b versus setting up a Python environment, CUDA libraries, and a vLLM configuration is a real difference. When you're the primary user, that setup cost doesn't amortize.

When We'd Switch to vLLM

Switch threshold: If we're serving more than ~5 concurrent users, building an internal AI API, or need to serve multiple LoRA-adapted models simultaneously — vLLM becomes worth the setup complexity.

Specific scenarios where we'd move to vLLM:

Building a team API: If other team members or services need to hit our local inference endpoint simultaneously, vLLM's continuous batching means dramatically better utilization of our 4× 3090 rig
Multi-LoRA serving: We have multiple task-specific fine-tunes — customer service, code review, document analysis. Serving all of these from a single vLLM instance with hot-swappable LoRA adapters is more efficient than running separate Ollama instances
High-context document processing: For batch-processing long documents (>32K tokens), vLLM + FlashAttention 4 handles this more efficiently than Ollama
Production deployment: If we ever move our AI stack to a cloud instance serving external traffic, vLLM is the obvious choice

The Migration Path

The good news: vLLM's OpenAI-compatible API means migration from Ollama is mostly a one-line change in your client config. The serving endpoint changes; your code doesn't.

# Ollama setup
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# vLLM setup (same client code, different URL)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

# Starting vLLM (4-GPU tensor parallel)
vllm serve Qwen/Qwen3.5-32B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --performance-mode balanced \
  --port 8000

Verdict

🎯 Bottom Line

vLLM is the production standard for LLM inference serving. If you're serving LLMs to multiple users, building an AI API, or running enterprise workloads — vLLM is almost certainly the right choice. It has the best throughput-to-complexity ratio of any open-source inference engine at scale.

For personal use and developer experimentation, Ollama wins on ease. For peak single-GPU NVIDIA performance, TensorRT-LLM can edge it out. For CPU/edge/ARM, llama.cpp is the answer. But for anything in between — production GPU serving with real concurrent load — vLLM has no serious competition in the open-source world.

v0.17.0 specifically is a generational leap: FlashAttention 4, Model Runner V2 with pipeline parallelism, the performance mode flag, and Elastic Expert Parallelism together make vLLM the first open-source inference engine that can realistically compete with managed cloud inference for enterprise-scale MoE model serving.

Quick Decision Guide

Your Situation	Recommendation
Personal local AI on Mac	Ollama
Personal local AI on Linux GPU rig	Ollama (for simplicity) or vLLM if you want more throughput
5+ concurrent users, production API	vLLM
Minimum latency, specific NVIDIA GPU + model	TensorRT-LLM
CPU-only or edge device	llama.cpp
175B+ model, multi-node	DeepSpeed Inference or vLLM with pipeline parallelism
Multi-LoRA serving (one base, many adapters)	vLLM
Structured JSON output / agentic workflows	vLLM (native JSON mode) or SGLang

Resources:
GitHub: vllm-project/vllm · Documentation · PagedAttention Paper (SOSP 2023) · vLLM Blog

Published March 7, 2026. Based on vLLM v0.17.0 release notes and community benchmarks. Performance figures are approximate and hardware-dependent.