What is vLLM?
vLLM is an open-source, high-throughput and memory-efficient inference and serving engine for large language models. It's the difference between an LLM deployment that serves 10 requests per second and one that serves 200 — on the exact same hardware.
At its core, vLLM solves the hardest problem in LLM serving: KV cache memory management. When a model generates tokens, every input token creates a key-value tensor that must stay in GPU memory until generation is complete. These tensors are massive (up to 1.7 GB for a single sequence in LLaMA-13B), dynamic in size, and traditionally wasted 60–80% of GPU memory due to fragmentation and over-reservation.
vLLM introduced PagedAttention — a new attention algorithm inspired by OS virtual memory and paging — that eliminated nearly all of that waste. The result: dramatically higher throughput, better GPU utilization, and the ability to serve far more concurrent users on the same hardware.
Origin Story
vLLM was born at the UC Berkeley Sky Computing Lab — the same group responsible for foundational systems research including Spark, Ray, and CRDT. The project was created by Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, and Lianmin Zheng, with faculty advisors Joseph E. Gonzalez, Hao Zhang, and Ion Stoica.
The founding motivation was practical: the team was running Chatbot Arena and Vicuna Demo at lmsys.org and discovered that even on expensive hardware, LLM serving was painfully slow and wasteful. They needed a way to serve thousands of users on limited compute — not a research toy, but something production-grade.
The project launched publicly in June 2023. By the time the SOSP 2023 paper ("Efficient Memory Management for Large Language Model Serving with PagedAttention") was published, vLLM had already become the de facto inference engine for anyone serious about LLM serving.
The SOSP paper formalized what the project had demonstrated empirically: PagedAttention achieves near-zero memory waste (under 4%) and enables 24× higher throughput than HuggingFace Transformers for common serving scenarios.
From Research Lab to Industry Standard
The trajectory was fast. Within months of launch, major AI companies — Anyscale, Replicate, NVIDIA, IBM, and others — adopted vLLM as their inference backbone. The open-source community exploded. By v0.17.0 (March 2026), the project had accumulated 699 commits from 272 contributors in a single release cycle — an extraordinary velocity for a systems project.
Core Architecture
vLLM's architecture is built around three core design decisions: PagedAttention for memory management, continuous batching for throughput, and an asynchronous engine design for production reliability.
System Overview
At a high level, vLLM consists of:
- LLM Engine — The core orchestrator that manages the request lifecycle, scheduling, and worker coordination
- Scheduler — Decides which requests to process, preempt, or swap based on available memory and priority
- Block Manager — Manages the physical KV cache blocks on GPU/CPU memory using the PagedAttention model
- Model Executor — Runs the actual model inference, supporting single-GPU, tensor-parallel, and pipeline-parallel execution
- OpenAI-compatible API Server — A FastAPI-based server with drop-in compatibility for OpenAI's
/v1/chat/completionsand/v1/completionsendpoints
The design separates the control plane (scheduling, memory management) from the data plane (model execution), enabling vLLM to make smart scheduling decisions without blocking on model execution.
PagedAttention: The Core Innovation
PagedAttention is the insight that made vLLM possible. To understand why it matters, you need to understand how naive LLM serving wastes memory.
The Problem with Naive KV Cache Management
In transformer models, every input token generates a key and value tensor that must remain in GPU memory throughout the generation process (the KV cache). The problem:
- Size uncertainty: You don't know how long a sequence will be until it's done generating. Traditional systems pre-allocate the maximum possible context length.
- Contiguous allocation: KV cache for each sequence must occupy contiguous GPU memory, leading to fragmentation as requests of different sizes come and go.
- No sharing: When multiple requests share a system prompt (common in API deployments), that prompt's KV cache is duplicated for every request.
The result: systems waste 60–80% of GPU memory, severely limiting how many requests can run concurrently.
The PagedAttention Solution
PagedAttention borrows from classical OS concepts — virtual memory and paging — and applies them to KV cache management:
- Pages instead of contiguous allocation: KV cache is divided into fixed-size blocks (typically 16 tokens each). Blocks don't need to be contiguous in physical memory.
- Block table mapping: Each sequence has a logical block table that maps its logical blocks to physical memory blocks, just like OS page tables map virtual to physical addresses.
- On-demand allocation: Physical blocks are only allocated when new tokens are actually generated, eliminating over-reservation.
- Copy-on-write prefix sharing: Multiple requests can share the same physical blocks for common prefixes (like system prompts). Blocks are only copied when a request needs to modify them.
The memory waste drops from 60–80% to under 4% — essentially just the last (partially filled) block of each sequence. This is transformative for multi-user serving: vLLM can fit far more concurrent requests into the same GPU memory.
# The core idea in pseudocode:
# Old way: allocate max_context_length * KV_size per request
# PagedAttention way:
class KVCacheManager:
block_size = 16 # tokens per block
free_blocks = [...] # pool of physical blocks
def allocate(self, sequence):
# Only allocate one block at a time as needed
block = self.free_blocks.pop()
sequence.block_table.append(block)
def share_prefix(self, seq1, seq2, shared_prefix_len):
# Point both sequences at the same physical blocks
shared_blocks = seq1.block_table[:shared_prefix_len // block_size]
seq2.block_table[:len(shared_blocks)] = shared_blocks
# Copy-on-write: only copy when a sequence modifies a shared block
Continuous Batching
PagedAttention solves memory. Continuous batching solves throughput.
The Static Batching Problem
Traditional LLM serving used static batching: wait until you have N requests, batch them together, run inference, return results. The problem is that sequences in a batch have different lengths. Short sequences finish quickly — but the GPU sits idle waiting for the longest sequence to complete before it can start new work. GPU utilization craters.
Continuous (Iteration-Level) Batching
vLLM uses continuous batching (sometimes called iteration-level scheduling): at each forward pass (token generation step), the scheduler looks at all pending requests and adds newly arrived requests to the batch. Requests that have completed are immediately removed. There's no waiting for a "batch" to fill — the GPU stays busy doing useful work every single iteration.
This is the reason vLLM's throughput numbers look so dramatically better than HuggingFace Transformers (which uses static batching by default): continuous batching can achieve 3–24× higher throughput depending on request mix, simply by keeping the GPU occupied.
Preemption and Swapping
When GPU memory gets full, vLLM doesn't just drop requests. The scheduler can:
- Preempt low-priority requests by swapping their KV cache blocks to CPU memory
- Resume them later when GPU memory becomes available
- Recompute KV cache for very short sequences instead of swapping (cheaper)
Quantization Support
vLLM supports a comprehensive suite of quantization formats, making it viable across a wide range of hardware and latency/quality tradeoffs:
| Format | Bits | Speed | Quality Loss | Best For |
|---|---|---|---|---|
| FP16/BF16 | 16-bit | Baseline | None | Max quality, A100/H100 |
| FP8 | 8-bit | 1.5–2× faster | Minimal | Production serving, H100/H200 |
| INT8 (W8A8) | 8-bit | 1.4–1.8× faster | Very low | Production, A100 |
| GPTQ | 4-bit | ~2× faster | Low–Medium | Consumer GPUs, memory constrained |
| AWQ | 4-bit | ~2× faster | Low | Best quality at 4-bit |
| QLoRA | 4-bit + LoRA | Good | Very low | Fine-tuned models (v0.17.0+) |
| SqueezeLLM | 4-bit sparse | ~2× faster | Very low | High accuracy 4-bit |
As of v0.17.0, vLLM can now directly load quantized LoRA adapters (QLoRA) without requiring a separate dequantize-then-apply step — a significant workflow improvement for fine-tuned model serving.
v0.17.0 — What's New (March 2026)
699 commits · 272 contributors (48 new) · PyTorch 2.10
v0.17.0 is a major milestone release that advances vLLM from "great production serving engine" to "complete large-scale inference platform." Here are the headline features:
FlashAttention 4 Integration
vLLM now supports the FlashAttention 4 backend — the next generation of the attention optimization that changed how transformers run on GPUs.
FlashAttention's key insight is tiling: instead of materializing the full attention matrix (which is O(n²) in sequence length), it computes attention in tiles that fit in SRAM, avoiding slow HBM reads. FlashAttention 4 extends this with:
- Warp specialization on Hopper/Blackwell GPUs — different warps handle data movement vs compute simultaneously, hiding memory latency
- Better pipelining of the softmax and matrix multiplications
- Native Blackwell support with H200/B200 optimizations
For long-context workloads (32K+ token sequences), FlashAttention 4 can deliver 1.5–2× speedup over FlashAttention 2 on modern NVIDIA hardware.
Model Runner V2 Maturation
Model Runner V2 is vLLM's next-generation execution engine, and v0.17.0 marks its major maturity milestone. Key capabilities added:
-
⚙️
Pipeline Parallelism Distribute model layers across multiple GPUs in a pipeline, enabling models that exceed single-GPU memory to run across multiple cards without full tensor parallelism overhead.
-
🔄
Decode Context Parallelism Parallelizes the decode phase across multiple GPUs, reducing time-to-next-token for long-running generations. This is separate from tensor parallelism and can be combined with it.
-
🚀
Eagle3 Speculative Decoding with CUDA Graphs Speculative decoding (draft model proposes tokens, main model verifies in parallel) now works with CUDA graph capture in Model Runner V2, eliminating graph replay overhead.
-
📦
Piecewise & Mixed CUDA Graph Capture CUDA graphs can now be captured for heterogeneous batches (prefill + decode mixed), and piecewise capture allows graphs to be built incrementally rather than all at once.
-
🏗️
New ModelState Architecture A cleaner abstraction for managing model state across different parallelism strategies. Design docs now available in the repository.
-
🤝
DP+EP for Speculative Decoding Data Parallelism combined with Expert Parallelism now works with speculative decoding — enabling MoE models like DeepSeek to use speculative decoding at scale.
The --performance-mode Flag
One of the most user-facing changes in v0.17.0: a new --performance-mode flag that simplifies the previously opaque performance tuning process.
vllm serve meta-llama/Llama-3-8B \
--performance-mode throughput # maximize requests/sec
vllm serve meta-llama/Llama-3-8B \
--performance-mode interactivity # minimize time-to-first-token
vllm serve meta-llama/Llama-3-8B \
--performance-mode balanced # default: balance both
| Mode | Optimizes For | Best Use Case |
|---|---|---|
throughput |
Max requests/second, GPU utilization | Batch processing, offline inference, high-volume APIs |
interactivity |
Time-to-first-token (TTFT), latency | Chat interfaces, real-time applications, low-latency requirements |
balanced |
Both TTFT and throughput | General-purpose serving, mixed workloads |
Previously, achieving these different modes required manual tuning of --max-num-batched-tokens, --max-num-seqs, --gpu-memory-utilization, and other low-level parameters. The new flag abstracts this into a single, declarative choice.
Weight Offloading V2 with Prefetching
For deployments where the model doesn't fit entirely in GPU memory, Weight Offloading V2 introduces prefetching — the CPU starts loading the next layer's weights into GPU memory while the GPU is still computing the current layer.
This dramatically reduces the "I/O stall" that makes weight offloading painful in practice. Additional improvements:
- Selective CPU offloading: Instead of offloading all weights, you can offload specific layers (e.g., early layers that see less reuse) while keeping hot layers in GPU VRAM
- No pinned memory doubling: Previous offloading implementations required pinning twice the model size in CPU RAM (one copy for storage, one for transfer buffers). V2 eliminates this overhead.
The practical impact: you can now run 70B+ parameter models on consumer hardware with 24 GB VRAM (like RTX 3090) and get usable throughput — not just "technically works."
Elastic Expert Parallelism (Milestone 2)
Mixture-of-Experts (MoE) models like Mixtral, DeepSeek, and the new Qwen3.5 family activate only a fraction of their parameters per token. This creates an imbalance: different GPUs may be doing wildly different amounts of work depending on which experts are activated.
Elastic Expert Parallelism allows dynamic GPU scaling for MoE models: experts can be redistributed across GPUs at runtime based on activation frequency and load. If expert 3 is getting hammered with 80% of requests while expert 7 is idle, the system can move compute resources accordingly.
Milestone 2 in v0.17.0 adds the initial infrastructure for this capability. Full dynamic rebalancing is on the roadmap.
Qwen3.5 + GDN (Gated Delta Networks)
v0.17.0 adds full support for the Qwen3.5 model family, which introduces Gated Delta Networks (GDN) — a hybrid architecture combining attention layers with recurrent-style delta update layers for more efficient long-context processing.
vLLM's Qwen3.5 support includes:
- FP8 quantization support for Qwen3.5 MoE variants
- MTP (Multi-Token Prediction) speculative decoding
- Reasoning parser integration
- Qwen3-ASR realtime streaming for audio transcription
Use Cases
Enterprise AI APIs
vLLM's OpenAI-compatible API means any application using the OpenAI SDK works out of the box. Companies like Anyscale (now part of Databricks), Replicate, and Together.ai run vLLM as their inference backbone. The continuous batching architecture means you can handle thousands of concurrent API requests without linear GPU scaling.
# Drop-in OpenAI API replacement
from openai import OpenAI
client = OpenAI(
base_url="http://your-vllm-server:8000/v1",
api_key="not-required"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
Local Multi-User Serving
For teams running their own inference infrastructure — internal tools, developer portals, or privacy-sensitive deployments — vLLM is the right choice when you have more than a handful of concurrent users. A single A100 80GB running vLLM can handle 50–200+ concurrent users on a 13B model depending on context length.
High-Context Research Workloads
With FlashAttention 4 and long-context model support (Qwen2.5-1M, Llama 3.1 128K, etc.), vLLM handles document-level tasks that would OOM naive serving setups: contract analysis, codebase summarization, multi-document research.
ASR (Automatic Speech Recognition)
v0.17.0 expanded vLLM beyond text-to-text. It now supports ASR models including FunASR, FireRedASR2, and Qwen3-ASR with realtime streaming — turning vLLM into a unified multimodal inference server for both text and audio workloads.
Multi-Model Serving
vLLM supports LoRA adapter hot-swapping — you can serve a single base model with multiple fine-tuned adapters loaded dynamically. This is transformative for organizations running dozens of task-specific models: instead of running 20 separate inference servers, you run one vLLM instance with 20 LoRA adapters loaded on demand.
vllm serve meta-llama/Llama-3-8B-Instruct \
--enable-lora \
--lora-modules \
customer-service=/path/to/cs-lora \
code-review=/path/to/code-lora \
summarizer=/path/to/sum-lora \
--max-loras 4 # how many to keep loaded simultaneously
Competitor Comparison
vLLM doesn't exist in a vacuum. Here's how it stacks up against every major alternative and when you'd choose each:
| Engine | Best At | Weaknesses | When to Use |
|---|---|---|---|
| vLLM | Multi-user throughput, broad model support, active ecosystem | Complex setup, NVIDIA-focused, not edge-friendly | Production serving, 5+ concurrent users, API backends |
| TensorRT-LLM | Single-GPU peak performance on NVIDIA hardware | Complex engine compilation, NVIDIA-only, limited model support | Maximum latency optimization on supported NVIDIA GPUs |
| Ollama | Ease of use, single-user local serving, macOS/Apple Silicon | Low throughput under concurrency, less GPU optimization | Personal use, developer testing, Mac hardware |
| llama.cpp | CPU inference, edge deployment, GGUF format ecosystem | Lower GPU throughput vs dedicated engines | No GPU, ARM devices, Raspberry Pi, Jetson |
| HuggingFace TGI | HuggingFace ecosystem integration, managed inference API | Lower throughput than vLLM, less flexible | Already deep in HuggingFace infra, managed cloud serving |
| DeepSpeed Inference | Very large models (100B+), ZeRO memory optimization | Complex setup, Microsoft-centric, lower throughput for smaller models | 175B+ models, multi-node inference, Microsoft Azure |
| SGLang | Structured generation (JSON/grammars), RadixAttention for shared prefixes | Smaller community, fewer models | Agentic workloads, constrained output, prompt reuse |
Performance Numbers in Context
Raw throughput comparisons from various benchmarks (LLaMA-class models, NVIDIA A100 80GB, mixed request lengths):
- TensorRT-LLM: ~180–220 req/sec (optimized for single setup, NVIDIA-only)
- vLLM: ~120–160 req/sec (broad hardware support, large model variety)
- HuggingFace TGI: ~100–140 req/sec
- Ollama: ~15–30 req/sec (optimized for single-user local use)
- HuggingFace Transformers (naive): ~8–15 req/sec
The nuance: TensorRT-LLM can beat vLLM's peak throughput on specific NVIDIA GPUs with specific models — but requires a multi-hour engine compilation step and fails with unsupported model architectures. vLLM runs nearly any HuggingFace model out of the box.
Strengths
✅ Strengths
- Best throughput for multi-user concurrent serving
- Near-universal model support (any HuggingFace model)
- Drop-in OpenAI API compatibility
- Active development (272 contributors, monthly releases)
- Broad hardware support (NVIDIA, AMD ROCm, Intel Gaudi, CPU)
- Production-tested (Anyscale, Together.ai, Replicate)
- LoRA adapter hot-swapping for multi-model serving
- Comprehensive quantization support (FP8, GPTQ, AWQ, QLoRA)
- FlashAttention 4 for long-context performance
- Speculative decoding to reduce latency
- Multi-modal (text + vision + audio in one server)
⚠️ Weaknesses
- Complex setup vs Ollama (Linux, CUDA, Python environment)
- Overkill for single-user local use
- TensorRT-LLM beats it on peak single-GPU latency (specific setups)
- No native macOS/Apple Silicon support
- Heavier resource footprint than llama.cpp
- Edge/embedded deployment not a design goal
- Documentation can lag behind fast release cadence
When NOT to Use vLLM
vLLM is not the right tool for every job. Be clear-eyed about its limitations:
Single-User Local Inference
If you're running models for personal use — querying a model yourself, one request at a time — Ollama is the right tool. vLLM's continuous batching provides zero benefit when you're the only user. You'll get similar or better individual response quality from Ollama with a fraction of the setup complexity.
Apple Silicon / macOS
vLLM requires CUDA (NVIDIA) or ROCm (AMD) and runs on Linux. If your hardware is a Mac with Apple Silicon, use Ollama (which uses Apple's MLX/Metal backends natively) or llama.cpp with Metal support. Apple's unified memory architecture means MBP M3 Ultra can run 70B+ models with excellent throughput for personal use cases.
Edge / IoT Deployment
Raspberry Pi? Jetson Nano? Embedded systems? llama.cpp is your answer. It runs on anything from a $35 Pi to a $500 Jetson with C++ efficiency, no Python runtime required.
Peak Latency on Specific NVIDIA GPUs
If you have a single H100/H200 and need to minimize time-to-first-token above all else for a specific model, TensorRT-LLM with a compiled engine can beat vLLM by 15–25%. The tradeoff: hours of engine compilation, NVIDIA-only, no flexibility to change models without recompiling.
Relevance to Our Local AI Stack
We currently run Ollama on a 4× RTX 3090 GPU rig as our local inference backend. Here's an honest assessment of when and whether we'd switch to vLLM.
What We're Running (Ollama Setup)
- Hardware: 4× RTX 3090 (96 GB VRAM total)
- Primary workloads: OpenClaw agent tasks, TTS (Kokoro), embedding generation
- Concurrency: Low — primarily our own agent workloads, not serving external users
- Models: Qwen3.5:35b-a3b, various coding models, embedding models
When Ollama Is the Right Answer (Now)
For our current workload — single-agent orchestration, low concurrency, ease of model management — Ollama wins on simplicity. ollama pull qwen3.5:35b versus setting up a Python environment, CUDA libraries, and a vLLM configuration is a real difference. When you're the primary user, that setup cost doesn't amortize.
When We'd Switch to vLLM
Specific scenarios where we'd move to vLLM:
- Building a team API: If other team members or services need to hit our local inference endpoint simultaneously, vLLM's continuous batching means dramatically better utilization of our 4× 3090 rig
- Multi-LoRA serving: We have multiple task-specific fine-tunes — customer service, code review, document analysis. Serving all of these from a single vLLM instance with hot-swappable LoRA adapters is more efficient than running separate Ollama instances
- High-context document processing: For batch-processing long documents (>32K tokens), vLLM + FlashAttention 4 handles this more efficiently than Ollama
- Production deployment: If we ever move our AI stack to a cloud instance serving external traffic, vLLM is the obvious choice
The Migration Path
The good news: vLLM's OpenAI-compatible API means migration from Ollama is mostly a one-line change in your client config. The serving endpoint changes; your code doesn't.
# Ollama setup
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# vLLM setup (same client code, different URL)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")
# Starting vLLM (4-GPU tensor parallel)
vllm serve Qwen/Qwen3.5-32B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--performance-mode balanced \
--port 8000
Verdict
🎯 Bottom Line
vLLM is the production standard for LLM inference serving. If you're serving LLMs to multiple users, building an AI API, or running enterprise workloads — vLLM is almost certainly the right choice. It has the best throughput-to-complexity ratio of any open-source inference engine at scale.
For personal use and developer experimentation, Ollama wins on ease. For peak single-GPU NVIDIA performance, TensorRT-LLM can edge it out. For CPU/edge/ARM, llama.cpp is the answer. But for anything in between — production GPU serving with real concurrent load — vLLM has no serious competition in the open-source world.
v0.17.0 specifically is a generational leap: FlashAttention 4, Model Runner V2 with pipeline parallelism, the performance mode flag, and Elastic Expert Parallelism together make vLLM the first open-source inference engine that can realistically compete with managed cloud inference for enterprise-scale MoE model serving.
Quick Decision Guide
| Your Situation | Recommendation |
|---|---|
| Personal local AI on Mac | Ollama |
| Personal local AI on Linux GPU rig | Ollama (for simplicity) or vLLM if you want more throughput |
| 5+ concurrent users, production API | vLLM |
| Minimum latency, specific NVIDIA GPU + model | TensorRT-LLM |
| CPU-only or edge device | llama.cpp |
| 175B+ model, multi-node | DeepSpeed Inference or vLLM with pipeline parallelism |
| Multi-LoRA serving (one base, many adapters) | vLLM |
| Structured JSON output / agentic workflows | vLLM (native JSON mode) or SGLang |
Published March 7, 2026. Based on vLLM v0.17.0 release notes and community benchmarks. Performance figures are approximate and hardware-dependent.