What Is a Tensor?
A tensor is a multi-dimensional array of numbers. That's it. If you've ever worked with arrays, matrices, or NumPy, you already understand tensors โ you just might not have called them that.
The word "tensor" comes from physics and differential geometry, where it has a more precise meaning involving coordinate transformations. In machine learning, the term is used more loosely to mean: any N-dimensional array of numbers that you can do math on.
| Name | Dimensions | Example | In Code |
|---|---|---|---|
| Scalar | 0-D tensor | The number 42 |
torch.tensor(42) |
| Vector | 1-D tensor | [1, 2, 3, 4] |
torch.tensor([1,2,3,4]) |
| Matrix | 2-D tensor | A spreadsheet of numbers | torch.zeros(4, 8) |
| 3-D Tensor | 3-D | A stack of matrices (like an RGB image) | torch.zeros(3, 224, 224) |
| N-D Tensor | N dimensions | Batch of sequences of token embeddings | torch.zeros(B, T, D) |
Every number in a tensor is called an element. Every tensor has a shape โ a tuple describing how many elements exist along each dimension. A tensor with shape (4, 8) has 4 rows and 8 columns, for 32 total elements.
Tensors in Code
In PyTorch (the dominant ML framework), tensors look and feel like NumPy arrays:
import torch
# A 2D tensor (matrix) โ shape (3, 4)
x = torch.tensor([
[1.0, 2.0, 3.0, 4.0],
[5.0, 6.0, 7.0, 8.0],
[9.0, 10.0, 11.0, 12.0],
])
print(x.shape) # torch.Size([3, 4])
print(x.dtype) # torch.float32
print(x.device) # cpu (or "cuda:0" if on GPU)
# Matrix multiply: (3,4) ร (4,2) โ (3,2)
y = torch.zeros(4, 2)
result = x @ y # @ is matrix multiply in Python
print(result.shape) # torch.Size([3, 2])
Two properties you'll see constantly in ML code:
.shapeโ the dimensions (e.g.,[batch_size, seq_len, hidden_dim]).dtypeโ the number type (float32,float16,bfloat16,int8)
Shape and Dtype: The Two Things That Matter Most
Shape
Shape tells you exactly how much memory a tensor needs and what operations you can do with it. LLM code is full of shapes like:
(B, H, T, T) โ batch ร num_heads ร seq ร seq (attention scores)
(B, H, T, Dh) โ batch ร num_heads ร seq ร head_dim (KV cache!)
B= batch size (how many requests are being processed at once)T= sequence length (how many tokens in the input)D= hidden dimension / model width (e.g., 4096 for Llama-3-8B)H= number of attention heads (e.g., 32 for Llama-3-8B)Dh= head dimension =D / H(e.g., 128)
Dtype โ The Precision Tradeoff
Every number in a tensor takes up a fixed amount of memory depending on its dtype:
| Dtype | Bytes/element | Range | Used For |
|---|---|---|---|
float32 (FP32) |
4 bytes | ยฑ3.4 ร 10ยณโธ | Training (full precision) |
bfloat16 (BF16) |
2 bytes | Same range as FP32, less precision | Training + inference (modern default) |
float16 (FP16) |
2 bytes | Smaller range than FP32 | Inference (older GPUs) |
int8 |
1 byte | -128 to 127 | Quantized inference (W8A8) |
float8 (FP8) |
1 byte | Very limited | Quantized inference on H100+ |
This is why quantization matters: switching a model from FP32 to INT8 cuts memory usage in half. Switching to INT4 cuts it by 75%. The model weights โ and crucially, the KV cache โ are all stored in these dtypes.
Tensors in Neural Networks
A neural network is a sequence of tensor transformations. Data flows in as a tensor, gets multiplied by weight tensors (the "parameters" that were learned during training), and produces output tensors.
The weight tensors are what get stored on your GPU when you "load a model." A 7B parameter model has 7 billion numbers in its weight tensors, all stored in VRAM. At BF16 (2 bytes each), that's 14 GB just for the weights.
The Transformer Attention Mechanism
Transformers are the architecture behind every modern LLM โ GPT, Llama, Qwen, Mistral, all of them. The defining feature of a transformer is the self-attention mechanism. This is where K and V tensors come from.
The fundamental question attention answers: for each token in the sequence, how much should it "pay attention to" every other token?
Example: in the sentence "The animal didn't cross the street because it was too tired," the word "it" needs to look back and find that "animal" is what it refers to. Attention computes this relationship mathematically.
How Attention Works (Step by Step)
For each layer of the transformer, every input token gets projected into three tensors through learned weight matrices:
# Each token's embedding (shape: [hidden_dim]) gets projected to:
Q = token_embedding @ W_Q # Query: "What am I looking for?"
K = token_embedding @ W_K # Key: "What do I contain?"
V = token_embedding @ W_V # Value: "What do I output if attended to?"
# W_Q, W_K, W_V are learned weight matrices, shape [D, Dh]
# Q, K, V each have shape [seq_len, head_dim] per attention head
Then attention scores are computed by asking: how much does each Query match each Key?
# Attention scores โ how much does token i attend to token j?
scores = Q @ K.transpose(-1, -2) # shape: [seq_len, seq_len]
scores = scores / sqrt(head_dim) # scale to prevent huge values
weights = softmax(scores) # convert to probabilities (sum to 1)
# Weighted sum of Values
output = weights @ V # shape: [seq_len, head_dim]
Q, K, V โ What They Actually Are
Let's be concrete about what these tensors look like for a real model.
Take Llama 3 8B:
- Hidden dimension (D): 4096
- Number of attention heads (H): 32
- Head dimension (Dh): 128 (= 4096 / 32)
- Number of layers: 32
For a single request with a 1,000-token input sequence:
K shape: [1000, 128] โ same
V shape: [1000, 128] โ same
= ~524 MB just for one 1K-token request
Notice something critical: Q is not cached. Queries are computed fresh for every new token. But K and V are different โ every time the model generates a new token, it needs to attend to all previous tokens, which means it needs their K and V tensors. If you recomputed them from scratch every step, you'd redo O(Tยฒ) work. So you save them โ that's the KV cache.
Why the KV Cache Exists
LLMs generate text one token at a time. At each step, the model needs to run the full attention computation over all previous tokens. Without caching:
The KV cache is a performance optimization that trades memory for compute. Instead of recomputing key and value tensors for every previous token at every generation step, you save them once and read them back. This is the single most important optimization in LLM inference โ without it, generating a 1,000-token response from a 1,000-token input would require roughly one million times more computation.
The Memory Problem
Here's the catch: the KV cache must live in GPU VRAM for fast access. And it's enormous.
The three specific problems from the context you shared:
1. Size Uncertainty
When a request arrives, you don't know how long the response will be. Will it generate 10 tokens? 2,000? The KV cache needs to grow as generation proceeds โ but traditional GPU memory allocators work best when you can reserve a fixed block upfront. Systems would pre-allocate the maximum possible context length just to be safe, wasting huge amounts of VRAM for short requests.
Example: if max_context = 8,192 tokens, every request reserves memory for 8,192 tokens of KV cache, even if it only generates 50 tokens. For Llama 3 8B, that's ~4 GB reserved per request โ whether you use it or not.
2. Contiguous Allocation
Traditional GPU memory allocators (like cudaMalloc) require a single contiguous block of memory for each allocation. KV cache for a sequence grows over time โ each new token extends the tensor by one row. In a contiguous allocator, this means you must either:
- Pre-allocate the maximum length upfront (wasteful), or
- Reallocate and copy the entire cache when you run out of space (slow), or
- Impose hard limits (inflexible)
Additionally, as requests of different lengths complete and free their memory, you get fragmentation: small gaps between allocations that individually can't fit new requests. The GPU might show 4 GB "free" but no single contiguous block larger than 500 MB โ meaning you can't actually fit a new request.
3. No Sharing
In many API deployments, every request starts with the same system prompt โ a set of instructions like "You are a helpful assistant..." or a company's specific instructions. The K and V tensors for that system prompt are identical across all requests, but naive systems allocate separate copies for each. If 100 requests are being served simultaneously and all share a 500-token system prompt, you're storing 100 identical copies of those KV tensors when you only need one.
How Big Is the KV Cache, Really?
Let's calculate this concretely for a few real models at BF16 (2 bytes per element):
# KV cache size formula:
# bytes = 2 ร num_layers ร num_heads ร seq_len ร head_dim ร bytes_per_element
#
# The "2ร" accounts for both K and V tensors
def kv_cache_bytes(num_layers, num_heads, seq_len, head_dim, dtype_bytes=2):
return 2 * num_layers * num_heads * seq_len * head_dim * dtype_bytes
# Llama 3 8B: 32 layers, 8 KV heads (GQA!), head_dim=128
llama_8b = kv_cache_bytes(32, 8, 8192, 128) # ~536 MB per request at 8K ctx
# Llama 3 70B: 80 layers, 8 KV heads, head_dim=128
llama_70b = kv_cache_bytes(80, 8, 8192, 128) # ~1.34 GB per request at 8K ctx
# Qwen2.5 72B: 80 layers, 8 KV heads, head_dim=128
qwen_72b = kv_cache_bytes(80, 8, 131072, 128) # ~21.5 GB per request at 128K ctx
Why This Shapes Everything in LLM Serving
Understanding tensors and the KV cache explains design decisions across the entire LLM inference stack:
PagedAttention (vLLM)
vLLM's breakthrough was applying OS paging to KV cache memory. Instead of one contiguous block per request, KV cache is split into fixed-size pages (blocks of 16 tokens each). Pages don't need to be contiguous in physical GPU memory โ a logical block table maps them, just like a CPU's page table maps virtual to physical addresses. This eliminates fragmentation and over-reservation, dropping memory waste from 60โ80% to under 4%.
Quantization
Every quantization decision affects the KV cache directly. Dropping from BF16 to INT8 cuts KV cache memory in half โ letting you serve twice as many concurrent requests on the same hardware. FP8 KV cache cuts it by another half again. This is why FP8 inference on H100s is so impactful: it's not just the weights, it's the KV cache too.
Context Length Scaling
KV cache grows linearly with context length, but the attention computation is quadratic. Doubling context length doubles KV cache memory but quadruples attention compute. This is why 128K-context models (Qwen2.5-1M, Llama 3.1 128K) are significantly harder to serve than 8K-context models โ not just the model weights, but the memory required per request balloons.
Speculative Decoding
A technique where a small "draft" model predicts several tokens ahead, and the big model verifies them all at once in a single forward pass. This works because the KV cache for the draft tokens is cheap (small model) and verification is batched. The speedup comes from generating multiple tokens with one expensive attention computation instead of one token per computation.
Prefix Caching
If the system prompt is always the same, compute its K and V tensors once, save them in a shared prefix cache, and reuse that cache for every new request. This is the "no sharing" problem solved. vLLM implements this via its copy-on-write block sharing in PagedAttention.
Mental Model Summary
๐ง The Complete Picture
A tensor is just a multi-dimensional array of numbers. Shape tells you its dimensions; dtype tells you how many bytes each number takes.
Transformer attention uses three tensors per token per layer: Query (what am I looking for?), Key (what do I contain?), Value (what information do I carry?). Q is ephemeral. K and V must be saved.
The KV cache is those saved K and V tensors โ stored in GPU VRAM so generation doesn't recompute them. It's the performance foundation of all LLM inference.
The KV cache is also the primary bottleneck: it's large (hundreds of MB to tens of GB per request), grows over time, and naive systems waste 60โ80% of it through fragmentation and over-allocation. Every modern serving engine โ vLLM, TensorRT-LLM, SGLang โ is largely defined by how cleverly it manages KV cache memory.
When you read about quantization, PagedAttention, GQA, or prefix caching โ you're reading about solutions to this one problem: KV tensors are expensive, and we need to fit more of them in GPU memory.
Quick Reference: Shapes You'll See in LLM Code
| Variable | Meaning | Example (Llama 3 8B) |
|---|---|---|
B |
Batch size (concurrent requests) | 4, 16, 64 |
T or seq_len |
Sequence length (tokens) | 1024, 8192 |
D or hidden_dim |
Model width (embeddings) | 4096 |
H or num_heads |
Attention heads (Q) | 32 |
Hkv or num_kv_heads |
KV heads (fewer with GQA) | 8 |
Dh or head_dim |
Per-head embedding size | 128 |
L or num_layers |
Transformer layers | 32 |
V or vocab_size |
Number of possible tokens | 128,256 |
[num_kv_heads, head_dim] per layer, in BF16, growing by one row every generated token, multiplied by the number of concurrent requests.
Further Reading
- vLLM: The Production LLM Inference Engine โ how PagedAttention solves the KV cache problem
- PagedAttention Paper (SOSP 2023) โ the original vLLM research paper
- GQA: Training Generalized Multi-Query Transformer Models โ how Grouped Query Attention reduces KV cache size
- PyTorch Tensor Tutorial โ hands-on introduction to tensors in code
Published March 11, 2026. Written as a companion explainer to the vLLM research post.