📺 Watch the video version: ThinkSmart.Life/youtube
24×
vs HuggingFace Transformers throughput
272
contributors in v0.17.0
<4%
memory waste (vs 60–80% in naive systems)
v0.17.0
latest release (March 2026)

What is vLLM?

vLLM is an open-source, high-throughput and memory-efficient inference and serving engine for large language models. It's the difference between an LLM deployment that serves 10 requests per second and one that serves 200 — on the exact same hardware.

At its core, vLLM solves the hardest problem in LLM serving: KV cache memory management. When a model generates tokens, every input token creates a key-value tensor that must stay in GPU memory until generation is complete. These tensors are massive (up to 1.7 GB for a single sequence in LLaMA-13B), dynamic in size, and traditionally wasted 60–80% of GPU memory due to fragmentation and over-reservation.

vLLM introduced PagedAttention — a new attention algorithm inspired by OS virtual memory and paging — that eliminated nearly all of that waste. The result: dramatically higher throughput, better GPU utilization, and the ability to serve far more concurrent users on the same hardware.

The one-sentence pitch: vLLM is what you run when you need to serve LLMs to real users at scale — it's the production-grade inference server that separates toy demos from real deployments.

Origin Story

vLLM was born at the UC Berkeley Sky Computing Lab — the same group responsible for foundational systems research including Spark, Ray, and CRDT. The project was created by Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, and Lianmin Zheng, with faculty advisors Joseph E. Gonzalez, Hao Zhang, and Ion Stoica.

The founding motivation was practical: the team was running Chatbot Arena and Vicuna Demo at lmsys.org and discovered that even on expensive hardware, LLM serving was painfully slow and wasteful. They needed a way to serve thousands of users on limited compute — not a research toy, but something production-grade.

The project launched publicly in June 2023. By the time the SOSP 2023 paper ("Efficient Memory Management for Large Language Model Serving with PagedAttention") was published, vLLM had already become the de facto inference engine for anyone serious about LLM serving.

The SOSP paper formalized what the project had demonstrated empirically: PagedAttention achieves near-zero memory waste (under 4%) and enables 24× higher throughput than HuggingFace Transformers for common serving scenarios.

From Research Lab to Industry Standard

The trajectory was fast. Within months of launch, major AI companies — Anyscale, Replicate, NVIDIA, IBM, and others — adopted vLLM as their inference backbone. The open-source community exploded. By v0.17.0 (March 2026), the project had accumulated 699 commits from 272 contributors in a single release cycle — an extraordinary velocity for a systems project.

Core Architecture

vLLM's architecture is built around three core design decisions: PagedAttention for memory management, continuous batching for throughput, and an asynchronous engine design for production reliability.

System Overview

At a high level, vLLM consists of:

The design separates the control plane (scheduling, memory management) from the data plane (model execution), enabling vLLM to make smart scheduling decisions without blocking on model execution.

PagedAttention: The Core Innovation

PagedAttention is the insight that made vLLM possible. To understand why it matters, you need to understand how naive LLM serving wastes memory.

The Problem with Naive KV Cache Management

In transformer models, every input token generates a key and value tensor that must remain in GPU memory throughout the generation process (the KV cache). The problem:

The result: systems waste 60–80% of GPU memory, severely limiting how many requests can run concurrently.

The PagedAttention Solution

PagedAttention borrows from classical OS concepts — virtual memory and paging — and applies them to KV cache management:

The memory waste drops from 60–80% to under 4% — essentially just the last (partially filled) block of each sequence. This is transformative for multi-user serving: vLLM can fit far more concurrent requests into the same GPU memory.

# The core idea in pseudocode:
# Old way: allocate max_context_length * KV_size per request
# PagedAttention way:
class KVCacheManager:
    block_size = 16  # tokens per block
    free_blocks = [...]  # pool of physical blocks

    def allocate(self, sequence):
        # Only allocate one block at a time as needed
        block = self.free_blocks.pop()
        sequence.block_table.append(block)

    def share_prefix(self, seq1, seq2, shared_prefix_len):
        # Point both sequences at the same physical blocks
        shared_blocks = seq1.block_table[:shared_prefix_len // block_size]
        seq2.block_table[:len(shared_blocks)] = shared_blocks
        # Copy-on-write: only copy when a sequence modifies a shared block

Continuous Batching

PagedAttention solves memory. Continuous batching solves throughput.

The Static Batching Problem

Traditional LLM serving used static batching: wait until you have N requests, batch them together, run inference, return results. The problem is that sequences in a batch have different lengths. Short sequences finish quickly — but the GPU sits idle waiting for the longest sequence to complete before it can start new work. GPU utilization craters.

Continuous (Iteration-Level) Batching

vLLM uses continuous batching (sometimes called iteration-level scheduling): at each forward pass (token generation step), the scheduler looks at all pending requests and adds newly arrived requests to the batch. Requests that have completed are immediately removed. There's no waiting for a "batch" to fill — the GPU stays busy doing useful work every single iteration.

This is the reason vLLM's throughput numbers look so dramatically better than HuggingFace Transformers (which uses static batching by default): continuous batching can achieve 3–24× higher throughput depending on request mix, simply by keeping the GPU occupied.

Preemption and Swapping

When GPU memory gets full, vLLM doesn't just drop requests. The scheduler can:

Quantization Support

vLLM supports a comprehensive suite of quantization formats, making it viable across a wide range of hardware and latency/quality tradeoffs:

Format Bits Speed Quality Loss Best For
FP16/BF16 16-bit Baseline None Max quality, A100/H100
FP8 8-bit 1.5–2× faster Minimal Production serving, H100/H200
INT8 (W8A8) 8-bit 1.4–1.8× faster Very low Production, A100
GPTQ 4-bit ~2× faster Low–Medium Consumer GPUs, memory constrained
AWQ 4-bit ~2× faster Low Best quality at 4-bit
QLoRA 4-bit + LoRA Good Very low Fine-tuned models (v0.17.0+)
SqueezeLLM 4-bit sparse ~2× faster Very low High accuracy 4-bit

As of v0.17.0, vLLM can now directly load quantized LoRA adapters (QLoRA) without requiring a separate dequantize-then-apply step — a significant workflow improvement for fine-tuned model serving.

v0.17.0 — What's New (March 2026)

⚡ vLLM v0.17.0 — March 7, 2026

699 commits · 272 contributors (48 new) · PyTorch 2.10

v0.17.0 is a major milestone release that advances vLLM from "great production serving engine" to "complete large-scale inference platform." Here are the headline features:

FlashAttention 4 Integration

vLLM now supports the FlashAttention 4 backend — the next generation of the attention optimization that changed how transformers run on GPUs.

FlashAttention's key insight is tiling: instead of materializing the full attention matrix (which is O(n²) in sequence length), it computes attention in tiles that fit in SRAM, avoiding slow HBM reads. FlashAttention 4 extends this with:

For long-context workloads (32K+ token sequences), FlashAttention 4 can deliver 1.5–2× speedup over FlashAttention 2 on modern NVIDIA hardware.

Hardware note: FlashAttention 4 benefits are most pronounced on H100/H200/Blackwell hardware. On A100 and older cards, FlashAttention 2 (already included) is the right choice.

Model Runner V2 Maturation

Model Runner V2 is vLLM's next-generation execution engine, and v0.17.0 marks its major maturity milestone. Key capabilities added:

The --performance-mode Flag

One of the most user-facing changes in v0.17.0: a new --performance-mode flag that simplifies the previously opaque performance tuning process.

vllm serve meta-llama/Llama-3-8B \
  --performance-mode throughput   # maximize requests/sec

vllm serve meta-llama/Llama-3-8B \
  --performance-mode interactivity  # minimize time-to-first-token

vllm serve meta-llama/Llama-3-8B \
  --performance-mode balanced  # default: balance both
Mode Optimizes For Best Use Case
throughput Max requests/second, GPU utilization Batch processing, offline inference, high-volume APIs
interactivity Time-to-first-token (TTFT), latency Chat interfaces, real-time applications, low-latency requirements
balanced Both TTFT and throughput General-purpose serving, mixed workloads

Previously, achieving these different modes required manual tuning of --max-num-batched-tokens, --max-num-seqs, --gpu-memory-utilization, and other low-level parameters. The new flag abstracts this into a single, declarative choice.

Weight Offloading V2 with Prefetching

For deployments where the model doesn't fit entirely in GPU memory, Weight Offloading V2 introduces prefetching — the CPU starts loading the next layer's weights into GPU memory while the GPU is still computing the current layer.

This dramatically reduces the "I/O stall" that makes weight offloading painful in practice. Additional improvements:

The practical impact: you can now run 70B+ parameter models on consumer hardware with 24 GB VRAM (like RTX 3090) and get usable throughput — not just "technically works."

Elastic Expert Parallelism (Milestone 2)

Mixture-of-Experts (MoE) models like Mixtral, DeepSeek, and the new Qwen3.5 family activate only a fraction of their parameters per token. This creates an imbalance: different GPUs may be doing wildly different amounts of work depending on which experts are activated.

Elastic Expert Parallelism allows dynamic GPU scaling for MoE models: experts can be redistributed across GPUs at runtime based on activation frequency and load. If expert 3 is getting hammered with 80% of requests while expert 7 is idle, the system can move compute resources accordingly.

Milestone 2 in v0.17.0 adds the initial infrastructure for this capability. Full dynamic rebalancing is on the roadmap.

Qwen3.5 + GDN (Gated Delta Networks)

v0.17.0 adds full support for the Qwen3.5 model family, which introduces Gated Delta Networks (GDN) — a hybrid architecture combining attention layers with recurrent-style delta update layers for more efficient long-context processing.

vLLM's Qwen3.5 support includes:

Use Cases

Enterprise AI APIs

vLLM's OpenAI-compatible API means any application using the OpenAI SDK works out of the box. Companies like Anyscale (now part of Databricks), Replicate, and Together.ai run vLLM as their inference backbone. The continuous batching architecture means you can handle thousands of concurrent API requests without linear GPU scaling.

# Drop-in OpenAI API replacement
from openai import OpenAI

client = OpenAI(
    base_url="http://your-vllm-server:8000/v1",
    api_key="not-required"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Local Multi-User Serving

For teams running their own inference infrastructure — internal tools, developer portals, or privacy-sensitive deployments — vLLM is the right choice when you have more than a handful of concurrent users. A single A100 80GB running vLLM can handle 50–200+ concurrent users on a 13B model depending on context length.

High-Context Research Workloads

With FlashAttention 4 and long-context model support (Qwen2.5-1M, Llama 3.1 128K, etc.), vLLM handles document-level tasks that would OOM naive serving setups: contract analysis, codebase summarization, multi-document research.

ASR (Automatic Speech Recognition)

v0.17.0 expanded vLLM beyond text-to-text. It now supports ASR models including FunASR, FireRedASR2, and Qwen3-ASR with realtime streaming — turning vLLM into a unified multimodal inference server for both text and audio workloads.

Multi-Model Serving

vLLM supports LoRA adapter hot-swapping — you can serve a single base model with multiple fine-tuned adapters loaded dynamically. This is transformative for organizations running dozens of task-specific models: instead of running 20 separate inference servers, you run one vLLM instance with 20 LoRA adapters loaded on demand.

vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-lora \
  --lora-modules \
    customer-service=/path/to/cs-lora \
    code-review=/path/to/code-lora \
    summarizer=/path/to/sum-lora \
  --max-loras 4  # how many to keep loaded simultaneously

Competitor Comparison

vLLM doesn't exist in a vacuum. Here's how it stacks up against every major alternative and when you'd choose each:

Engine Best At Weaknesses When to Use
vLLM Multi-user throughput, broad model support, active ecosystem Complex setup, NVIDIA-focused, not edge-friendly Production serving, 5+ concurrent users, API backends
TensorRT-LLM Single-GPU peak performance on NVIDIA hardware Complex engine compilation, NVIDIA-only, limited model support Maximum latency optimization on supported NVIDIA GPUs
Ollama Ease of use, single-user local serving, macOS/Apple Silicon Low throughput under concurrency, less GPU optimization Personal use, developer testing, Mac hardware
llama.cpp CPU inference, edge deployment, GGUF format ecosystem Lower GPU throughput vs dedicated engines No GPU, ARM devices, Raspberry Pi, Jetson
HuggingFace TGI HuggingFace ecosystem integration, managed inference API Lower throughput than vLLM, less flexible Already deep in HuggingFace infra, managed cloud serving
DeepSpeed Inference Very large models (100B+), ZeRO memory optimization Complex setup, Microsoft-centric, lower throughput for smaller models 175B+ models, multi-node inference, Microsoft Azure
SGLang Structured generation (JSON/grammars), RadixAttention for shared prefixes Smaller community, fewer models Agentic workloads, constrained output, prompt reuse

Performance Numbers in Context

Raw throughput comparisons from various benchmarks (LLaMA-class models, NVIDIA A100 80GB, mixed request lengths):

The nuance: TensorRT-LLM can beat vLLM's peak throughput on specific NVIDIA GPUs with specific models — but requires a multi-hour engine compilation step and fails with unsupported model architectures. vLLM runs nearly any HuggingFace model out of the box.

Strengths

✅ Strengths

  • Best throughput for multi-user concurrent serving
  • Near-universal model support (any HuggingFace model)
  • Drop-in OpenAI API compatibility
  • Active development (272 contributors, monthly releases)
  • Broad hardware support (NVIDIA, AMD ROCm, Intel Gaudi, CPU)
  • Production-tested (Anyscale, Together.ai, Replicate)
  • LoRA adapter hot-swapping for multi-model serving
  • Comprehensive quantization support (FP8, GPTQ, AWQ, QLoRA)
  • FlashAttention 4 for long-context performance
  • Speculative decoding to reduce latency
  • Multi-modal (text + vision + audio in one server)

⚠️ Weaknesses

  • Complex setup vs Ollama (Linux, CUDA, Python environment)
  • Overkill for single-user local use
  • TensorRT-LLM beats it on peak single-GPU latency (specific setups)
  • No native macOS/Apple Silicon support
  • Heavier resource footprint than llama.cpp
  • Edge/embedded deployment not a design goal
  • Documentation can lag behind fast release cadence

When NOT to Use vLLM

vLLM is not the right tool for every job. Be clear-eyed about its limitations:

Single-User Local Inference

If you're running models for personal use — querying a model yourself, one request at a time — Ollama is the right tool. vLLM's continuous batching provides zero benefit when you're the only user. You'll get similar or better individual response quality from Ollama with a fraction of the setup complexity.

Apple Silicon / macOS

vLLM requires CUDA (NVIDIA) or ROCm (AMD) and runs on Linux. If your hardware is a Mac with Apple Silicon, use Ollama (which uses Apple's MLX/Metal backends natively) or llama.cpp with Metal support. Apple's unified memory architecture means MBP M3 Ultra can run 70B+ models with excellent throughput for personal use cases.

Edge / IoT Deployment

Raspberry Pi? Jetson Nano? Embedded systems? llama.cpp is your answer. It runs on anything from a $35 Pi to a $500 Jetson with C++ efficiency, no Python runtime required.

Peak Latency on Specific NVIDIA GPUs

If you have a single H100/H200 and need to minimize time-to-first-token above all else for a specific model, TensorRT-LLM with a compiled engine can beat vLLM by 15–25%. The tradeoff: hours of engine compilation, NVIDIA-only, no flexibility to change models without recompiling.

Relevance to Our Local AI Stack

We currently run Ollama on a 4× RTX 3090 GPU rig as our local inference backend. Here's an honest assessment of when and whether we'd switch to vLLM.

What We're Running (Ollama Setup)

When Ollama Is the Right Answer (Now)

For our current workload — single-agent orchestration, low concurrency, ease of model management — Ollama wins on simplicity. ollama pull qwen3.5:35b versus setting up a Python environment, CUDA libraries, and a vLLM configuration is a real difference. When you're the primary user, that setup cost doesn't amortize.

When We'd Switch to vLLM

Switch threshold: If we're serving more than ~5 concurrent users, building an internal AI API, or need to serve multiple LoRA-adapted models simultaneously — vLLM becomes worth the setup complexity.

Specific scenarios where we'd move to vLLM:

The Migration Path

The good news: vLLM's OpenAI-compatible API means migration from Ollama is mostly a one-line change in your client config. The serving endpoint changes; your code doesn't.

# Ollama setup
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# vLLM setup (same client code, different URL)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

# Starting vLLM (4-GPU tensor parallel)
vllm serve Qwen/Qwen3.5-32B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --performance-mode balanced \
  --port 8000

Verdict

🎯 Bottom Line

vLLM is the production standard for LLM inference serving. If you're serving LLMs to multiple users, building an AI API, or running enterprise workloads — vLLM is almost certainly the right choice. It has the best throughput-to-complexity ratio of any open-source inference engine at scale.

For personal use and developer experimentation, Ollama wins on ease. For peak single-GPU NVIDIA performance, TensorRT-LLM can edge it out. For CPU/edge/ARM, llama.cpp is the answer. But for anything in between — production GPU serving with real concurrent load — vLLM has no serious competition in the open-source world.

v0.17.0 specifically is a generational leap: FlashAttention 4, Model Runner V2 with pipeline parallelism, the performance mode flag, and Elastic Expert Parallelism together make vLLM the first open-source inference engine that can realistically compete with managed cloud inference for enterprise-scale MoE model serving.

Quick Decision Guide

Your Situation Recommendation
Personal local AI on Mac Ollama
Personal local AI on Linux GPU rig Ollama (for simplicity) or vLLM if you want more throughput
5+ concurrent users, production API vLLM
Minimum latency, specific NVIDIA GPU + model TensorRT-LLM
CPU-only or edge device llama.cpp
175B+ model, multi-node DeepSpeed Inference or vLLM with pipeline parallelism
Multi-LoRA serving (one base, many adapters) vLLM
Structured JSON output / agentic workflows vLLM (native JSON mode) or SGLang

Published March 7, 2026. Based on vLLM v0.17.0 release notes and community benchmarks. Performance figures are approximate and hardware-dependent.