๐Ÿ“บ Watch the video version: ThinkSmart.Life/youtube
Who this is for: Software engineers who understand arrays and data structures but want to make sense of tensor math, transformers, and why KV cache is the central bottleneck in LLM inference. No ML background required.

What Is a Tensor?

A tensor is a multi-dimensional array of numbers. That's it. If you've ever worked with arrays, matrices, or NumPy, you already understand tensors โ€” you just might not have called them that.

The word "tensor" comes from physics and differential geometry, where it has a more precise meaning involving coordinate transformations. In machine learning, the term is used more loosely to mean: any N-dimensional array of numbers that you can do math on.

Name Dimensions Example In Code
Scalar 0-D tensor The number 42 torch.tensor(42)
Vector 1-D tensor [1, 2, 3, 4] torch.tensor([1,2,3,4])
Matrix 2-D tensor A spreadsheet of numbers torch.zeros(4, 8)
3-D Tensor 3-D A stack of matrices (like an RGB image) torch.zeros(3, 224, 224)
N-D Tensor N dimensions Batch of sequences of token embeddings torch.zeros(B, T, D)

Every number in a tensor is called an element. Every tensor has a shape โ€” a tuple describing how many elements exist along each dimension. A tensor with shape (4, 8) has 4 rows and 8 columns, for 32 total elements.

The key insight: All the "intelligence" in a neural network lives in tensors โ€” billions of numbers arranged in multi-dimensional arrays, combined through matrix multiplication. There's no magic, just a lot of organized arithmetic.

Tensors in Code

In PyTorch (the dominant ML framework), tensors look and feel like NumPy arrays:

import torch

# A 2D tensor (matrix) โ€” shape (3, 4)
x = torch.tensor([
    [1.0, 2.0, 3.0, 4.0],
    [5.0, 6.0, 7.0, 8.0],
    [9.0, 10.0, 11.0, 12.0],
])
print(x.shape)   # torch.Size([3, 4])
print(x.dtype)   # torch.float32
print(x.device)  # cpu  (or "cuda:0" if on GPU)

# Matrix multiply: (3,4) ร— (4,2) โ†’ (3,2)
y = torch.zeros(4, 2)
result = x @ y   # @ is matrix multiply in Python
print(result.shape)  # torch.Size([3, 2])

Two properties you'll see constantly in ML code:

Shape and Dtype: The Two Things That Matter Most

Shape

Shape tells you exactly how much memory a tensor needs and what operations you can do with it. LLM code is full of shapes like:

Common tensor shapes in LLM inference:
(B, T, D) โ€” batch ร— sequence_length ร— hidden_dim
(B, H, T, T) โ€” batch ร— num_heads ร— seq ร— seq (attention scores)
(B, H, T, Dh) โ€” batch ร— num_heads ร— seq ร— head_dim (KV cache!)

Dtype โ€” The Precision Tradeoff

Every number in a tensor takes up a fixed amount of memory depending on its dtype:

Dtype Bytes/element Range Used For
float32 (FP32) 4 bytes ยฑ3.4 ร— 10ยณโธ Training (full precision)
bfloat16 (BF16) 2 bytes Same range as FP32, less precision Training + inference (modern default)
float16 (FP16) 2 bytes Smaller range than FP32 Inference (older GPUs)
int8 1 byte -128 to 127 Quantized inference (W8A8)
float8 (FP8) 1 byte Very limited Quantized inference on H100+

This is why quantization matters: switching a model from FP32 to INT8 cuts memory usage in half. Switching to INT4 cuts it by 75%. The model weights โ€” and crucially, the KV cache โ€” are all stored in these dtypes.

Tensors in Neural Networks

A neural network is a sequence of tensor transformations. Data flows in as a tensor, gets multiplied by weight tensors (the "parameters" that were learned during training), and produces output tensors.

Input Tensor โ†’ ร— Weight Matrix โ†’ Activation Function โ†’ Output Tensor
Repeat this pattern hundreds of times โ†’ you have a transformer

The weight tensors are what get stored on your GPU when you "load a model." A 7B parameter model has 7 billion numbers in its weight tensors, all stored in VRAM. At BF16 (2 bytes each), that's 14 GB just for the weights.

Mental model: Model loading = copying weight tensors from CPU RAM to GPU VRAM. Running inference = performing tensor math (mostly matrix multiplications) on those weights.

The Transformer Attention Mechanism

Transformers are the architecture behind every modern LLM โ€” GPT, Llama, Qwen, Mistral, all of them. The defining feature of a transformer is the self-attention mechanism. This is where K and V tensors come from.

The fundamental question attention answers: for each token in the sequence, how much should it "pay attention to" every other token?

Example: in the sentence "The animal didn't cross the street because it was too tired," the word "it" needs to look back and find that "animal" is what it refers to. Attention computes this relationship mathematically.

How Attention Works (Step by Step)

For each layer of the transformer, every input token gets projected into three tensors through learned weight matrices:

# Each token's embedding (shape: [hidden_dim]) gets projected to:
Q = token_embedding @ W_Q   # Query:  "What am I looking for?"
K = token_embedding @ W_K   # Key:    "What do I contain?"
V = token_embedding @ W_V   # Value:  "What do I output if attended to?"

# W_Q, W_K, W_V are learned weight matrices, shape [D, Dh]
# Q, K, V each have shape [seq_len, head_dim] per attention head

Then attention scores are computed by asking: how much does each Query match each Key?

# Attention scores โ€” how much does token i attend to token j?
scores = Q @ K.transpose(-1, -2)   # shape: [seq_len, seq_len]
scores = scores / sqrt(head_dim)   # scale to prevent huge values
weights = softmax(scores)          # convert to probabilities (sum to 1)

# Weighted sum of Values
output = weights @ V               # shape: [seq_len, head_dim]
In plain English: Queries ask questions. Keys advertise their content. Values carry the actual information. Attention computes how much each query matches each key, then returns a weighted mix of the values. Every token can look at every other token.

Q, K, V โ€” What They Actually Are

Let's be concrete about what these tensors look like for a real model.

Take Llama 3 8B:

For a single request with a 1,000-token input sequence:

Per layer, per attention head, per request:
Q shape: [1000, 128] โ†’ 1000 tokens ร— 128 dims = 128,000 numbers
K shape: [1000, 128] โ†’ same
V shape: [1000, 128] โ†’ same

For the full model (32 heads ร— 32 layers):
KV total: 2 ร— [1000, 32, 128] ร— 32 layers ร— 2 bytes (BF16)
         = ~524 MB just for one 1K-token request

Notice something critical: Q is not cached. Queries are computed fresh for every new token. But K and V are different โ€” every time the model generates a new token, it needs to attend to all previous tokens, which means it needs their K and V tensors. If you recomputed them from scratch every step, you'd redo O(Tยฒ) work. So you save them โ€” that's the KV cache.

Why the KV Cache Exists

LLMs generate text one token at a time. At each step, the model needs to run the full attention computation over all previous tokens. Without caching:

Without KV Cache (naive)
Generate token 1 โ†’ Compute K,V for all 1 tokens
Generate token 2 โ†’ Recompute K,V for all 2 tokens
Generate token N โ†’ Recompute K,V for all N tokens
Cost: O(Tยฒ) โ€” quadratic in sequence length
With KV Cache
Generate token 1 โ†’ Compute K,V for token 1 โ†’ Save to cache
Generate token 2 โ†’ Compute K,V only for token 2 โ†’ Append to cache
Generate token N โ†’ Look up all previous K,V from cache
Cost: O(T) โ€” linear in sequence length

The KV cache is a performance optimization that trades memory for compute. Instead of recomputing key and value tensors for every previous token at every generation step, you save them once and read them back. This is the single most important optimization in LLM inference โ€” without it, generating a 1,000-token response from a 1,000-token input would require roughly one million times more computation.

The Memory Problem

Here's the catch: the KV cache must live in GPU VRAM for fast access. And it's enormous.

The three specific problems from the context you shared:

1. Size Uncertainty

When a request arrives, you don't know how long the response will be. Will it generate 10 tokens? 2,000? The KV cache needs to grow as generation proceeds โ€” but traditional GPU memory allocators work best when you can reserve a fixed block upfront. Systems would pre-allocate the maximum possible context length just to be safe, wasting huge amounts of VRAM for short requests.

Example: if max_context = 8,192 tokens, every request reserves memory for 8,192 tokens of KV cache, even if it only generates 50 tokens. For Llama 3 8B, that's ~4 GB reserved per request โ€” whether you use it or not.

2. Contiguous Allocation

Traditional GPU memory allocators (like cudaMalloc) require a single contiguous block of memory for each allocation. KV cache for a sequence grows over time โ€” each new token extends the tensor by one row. In a contiguous allocator, this means you must either:

Additionally, as requests of different lengths complete and free their memory, you get fragmentation: small gaps between allocations that individually can't fit new requests. The GPU might show 4 GB "free" but no single contiguous block larger than 500 MB โ€” meaning you can't actually fit a new request.

3. No Sharing

In many API deployments, every request starts with the same system prompt โ€” a set of instructions like "You are a helpful assistant..." or a company's specific instructions. The K and V tensors for that system prompt are identical across all requests, but naive systems allocate separate copies for each. If 100 requests are being served simultaneously and all share a 500-token system prompt, you're storing 100 identical copies of those KV tensors when you only need one.

How Big Is the KV Cache, Really?

Let's calculate this concretely for a few real models at BF16 (2 bytes per element):

# KV cache size formula:
# bytes = 2 ร— num_layers ร— num_heads ร— seq_len ร— head_dim ร— bytes_per_element
#
# The "2ร—" accounts for both K and V tensors

def kv_cache_bytes(num_layers, num_heads, seq_len, head_dim, dtype_bytes=2):
    return 2 * num_layers * num_heads * seq_len * head_dim * dtype_bytes

# Llama 3 8B: 32 layers, 8 KV heads (GQA!), head_dim=128
llama_8b = kv_cache_bytes(32, 8, 8192, 128)   # ~536 MB per request at 8K ctx

# Llama 3 70B: 80 layers, 8 KV heads, head_dim=128
llama_70b = kv_cache_bytes(80, 8, 8192, 128)  # ~1.34 GB per request at 8K ctx

# Qwen2.5 72B: 80 layers, 8 KV heads, head_dim=128
qwen_72b = kv_cache_bytes(80, 8, 131072, 128) # ~21.5 GB per request at 128K ctx
536 MB
Llama 3 8B ยท 8K context ยท one request
1.3 GB
Llama 3 70B ยท 8K context ยท one request
21 GB
Qwen2.5 72B ยท 128K context ยท one request
60โ€“80%
GPU memory wasted in naive systems
Note on GQA (Grouped Query Attention): Modern models use fewer KV heads than Q heads. Llama 3 has 32 query heads but only 8 KV heads โ€” this is Grouped Query Attention (GQA), specifically designed to reduce KV cache size. Each KV head is shared across 4 Q heads. Without GQA, the KV cache would be 4ร— larger.

Why This Shapes Everything in LLM Serving

Understanding tensors and the KV cache explains design decisions across the entire LLM inference stack:

PagedAttention (vLLM)

vLLM's breakthrough was applying OS paging to KV cache memory. Instead of one contiguous block per request, KV cache is split into fixed-size pages (blocks of 16 tokens each). Pages don't need to be contiguous in physical GPU memory โ€” a logical block table maps them, just like a CPU's page table maps virtual to physical addresses. This eliminates fragmentation and over-reservation, dropping memory waste from 60โ€“80% to under 4%.

Quantization

Every quantization decision affects the KV cache directly. Dropping from BF16 to INT8 cuts KV cache memory in half โ€” letting you serve twice as many concurrent requests on the same hardware. FP8 KV cache cuts it by another half again. This is why FP8 inference on H100s is so impactful: it's not just the weights, it's the KV cache too.

Context Length Scaling

KV cache grows linearly with context length, but the attention computation is quadratic. Doubling context length doubles KV cache memory but quadruples attention compute. This is why 128K-context models (Qwen2.5-1M, Llama 3.1 128K) are significantly harder to serve than 8K-context models โ€” not just the model weights, but the memory required per request balloons.

Speculative Decoding

A technique where a small "draft" model predicts several tokens ahead, and the big model verifies them all at once in a single forward pass. This works because the KV cache for the draft tokens is cheap (small model) and verification is batched. The speedup comes from generating multiple tokens with one expensive attention computation instead of one token per computation.

Prefix Caching

If the system prompt is always the same, compute its K and V tensors once, save them in a shared prefix cache, and reuse that cache for every new request. This is the "no sharing" problem solved. vLLM implements this via its copy-on-write block sharing in PagedAttention.

Mental Model Summary

๐Ÿง  The Complete Picture

A tensor is just a multi-dimensional array of numbers. Shape tells you its dimensions; dtype tells you how many bytes each number takes.

Transformer attention uses three tensors per token per layer: Query (what am I looking for?), Key (what do I contain?), Value (what information do I carry?). Q is ephemeral. K and V must be saved.

The KV cache is those saved K and V tensors โ€” stored in GPU VRAM so generation doesn't recompute them. It's the performance foundation of all LLM inference.

The KV cache is also the primary bottleneck: it's large (hundreds of MB to tens of GB per request), grows over time, and naive systems waste 60โ€“80% of it through fragmentation and over-allocation. Every modern serving engine โ€” vLLM, TensorRT-LLM, SGLang โ€” is largely defined by how cleverly it manages KV cache memory.

When you read about quantization, PagedAttention, GQA, or prefix caching โ€” you're reading about solutions to this one problem: KV tensors are expensive, and we need to fit more of them in GPU memory.

Quick Reference: Shapes You'll See in LLM Code

Variable Meaning Example (Llama 3 8B)
B Batch size (concurrent requests) 4, 16, 64
T or seq_len Sequence length (tokens) 1024, 8192
D or hidden_dim Model width (embeddings) 4096
H or num_heads Attention heads (Q) 32
Hkv or num_kv_heads KV heads (fewer with GQA) 8
Dh or head_dim Per-head embedding size 128
L or num_layers Transformer layers 32
V or vocab_size Number of possible tokens 128,256
Now when you read: "Every input token generates a key and value tensor that must remain in GPU memory throughout the generation process" โ€” you know exactly what that means: a tensor of shape [num_kv_heads, head_dim] per layer, in BF16, growing by one row every generated token, multiplied by the number of concurrent requests.

Further Reading

Published March 11, 2026. Written as a companion explainer to the vLLM research post.