🧮 Tensors Explained: From Arrays to KV Cache

The math behind LLMs isn't exotic — it's just arrays. Here's everything you need to understand tensors, how they move through transformer models, and why Key-Value tensors are the single biggest constraint in production LLM serving.

March 11, 2026 · 15 min read · Foundations, ML Inference

📺 Watch the video version:

Who this is for: Software engineers who understand arrays and data structures but want to make sense of tensor math, transformers, and why KV cache is the central bottleneck in LLM inference. No ML background required.

What Is a Tensor?

A tensor is a multi-dimensional array of numbers. That's it. If you've ever worked with arrays, matrices, or NumPy, you already understand tensors — you just might not have called them that.

The word "tensor" comes from physics and differential geometry, where it has a more precise meaning involving coordinate transformations. In machine learning, the term is used more loosely to mean: any N-dimensional array of numbers that you can do math on.

Name	Dimensions	Example	In Code
Scalar	0-D tensor	The number `42`	`torch.tensor(42)`
Vector	1-D tensor	`[1, 2, 3, 4]`	`torch.tensor([1,2,3,4])`
Matrix	2-D tensor	A spreadsheet of numbers	`torch.zeros(4, 8)`
3-D Tensor	3-D	A stack of matrices (like an RGB image)	`torch.zeros(3, 224, 224)`
N-D Tensor	N dimensions	Batch of sequences of token embeddings	`torch.zeros(B, T, D)`

Every number in a tensor is called an element. Every tensor has a shape — a tuple describing how many elements exist along each dimension. A tensor with shape (4, 8) has 4 rows and 8 columns, for 32 total elements.

The key insight: All the "intelligence" in a neural network lives in tensors — billions of numbers arranged in multi-dimensional arrays, combined through matrix multiplication. There's no magic, just a lot of organized arithmetic.

Tensors in Code

In PyTorch (the dominant ML framework), tensors look and feel like NumPy arrays:

import torch

# A 2D tensor (matrix) — shape (3, 4)
x = torch.tensor([
    [1.0, 2.0, 3.0, 4.0],
    [5.0, 6.0, 7.0, 8.0],
    [9.0, 10.0, 11.0, 12.0],
])
print(x.shape)   # torch.Size([3, 4])
print(x.dtype)   # torch.float32
print(x.device)  # cpu  (or "cuda:0" if on GPU)

# Matrix multiply: (3,4) × (4,2) → (3,2)
y = torch.zeros(4, 2)
result = x @ y   # @ is matrix multiply in Python
print(result.shape)  # torch.Size([3, 2])

Two properties you'll see constantly in ML code:

.shape — the dimensions (e.g., [batch_size, seq_len, hidden_dim])
.dtype — the number type (float32, float16, bfloat16, int8)

Shape and Dtype: The Two Things That Matter Most

Shape

Shape tells you exactly how much memory a tensor needs and what operations you can do with it. LLM code is full of shapes like:

Common tensor shapes in LLM inference:

(B, T, D) — batch × sequence_length × hidden_dim
(B, H, T, T) — batch × num_heads × seq × seq (attention scores)
(B, H, T, Dh) — batch × num_heads × seq × head_dim (KV cache!)

B = batch size (how many requests are being processed at once)
T = sequence length (how many tokens in the input)
D = hidden dimension / model width (e.g., 4096 for Llama-3-8B)
H = number of attention heads (e.g., 32 for Llama-3-8B)
Dh = head dimension = D / H (e.g., 128)

Dtype — The Precision Tradeoff

Every number in a tensor takes up a fixed amount of memory depending on its dtype:

Dtype	Bytes/element	Range	Used For
`float32` (FP32)	4 bytes	±3.4 × 10³⁸	Training (full precision)
`bfloat16` (BF16)	2 bytes	Same range as FP32, less precision	Training + inference (modern default)
`float16` (FP16)	2 bytes	Smaller range than FP32	Inference (older GPUs)
`int8`	1 byte	-128 to 127	Quantized inference (W8A8)
`float8` (FP8)	1 byte	Very limited	Quantized inference on H100+

This is why quantization matters: switching a model from FP32 to INT8 cuts memory usage in half. Switching to INT4 cuts it by 75%. The model weights — and crucially, the KV cache — are all stored in these dtypes.

Tensors in Neural Networks

A neural network is a sequence of tensor transformations. Data flows in as a tensor, gets multiplied by weight tensors (the "parameters" that were learned during training), and produces output tensors.

Input Tensor → × Weight Matrix → Activation Function → Output Tensor

Repeat this pattern hundreds of times → you have a transformer

The weight tensors are what get stored on your GPU when you "load a model." A 7B parameter model has 7 billion numbers in its weight tensors, all stored in VRAM. At BF16 (2 bytes each), that's 14 GB just for the weights.

Mental model: Model loading = copying weight tensors from CPU RAM to GPU VRAM. Running inference = performing tensor math (mostly matrix multiplications) on those weights.

The Transformer Attention Mechanism

Transformers are the architecture behind every modern LLM — GPT, Llama, Qwen, Mistral, all of them. The defining feature of a transformer is the self-attention mechanism. This is where K and V tensors come from.

The fundamental question attention answers: for each token in the sequence, how much should it "pay attention to" every other token?

Example: in the sentence "The animal didn't cross the street because it was too tired," the word "it" needs to look back and find that "animal" is what it refers to. Attention computes this relationship mathematically.

How Attention Works (Step by Step)

For each layer of the transformer, every input token gets projected into three tensors through learned weight matrices:

# Each token's embedding (shape: [hidden_dim]) gets projected to:
Q = token_embedding @ W_Q   # Query:  "What am I looking for?"
K = token_embedding @ W_K   # Key:    "What do I contain?"
V = token_embedding @ W_V   # Value:  "What do I output if attended to?"

# W_Q, W_K, W_V are learned weight matrices, shape [D, Dh]
# Q, K, V each have shape [seq_len, head_dim] per attention head

Then attention scores are computed by asking: how much does each Query match each Key?

# Attention scores — how much does token i attend to token j?
scores = Q @ K.transpose(-1, -2)   # shape: [seq_len, seq_len]
scores = scores / sqrt(head_dim)   # scale to prevent huge values
weights = softmax(scores)          # convert to probabilities (sum to 1)

# Weighted sum of Values
output = weights @ V               # shape: [seq_len, head_dim]

In plain English: Queries ask questions. Keys advertise their content. Values carry the actual information. Attention computes how much each query matches each key, then returns a weighted mix of the values. Every token can look at every other token.

Q, K, V — What They Actually Are

Let's be concrete about what these tensors look like for a real model.

Take Llama 3 8B:

Hidden dimension (D): 4096
Number of attention heads (H): 32
Head dimension (Dh): 128 (= 4096 / 32)
Number of layers: 32

For a single request with a 1,000-token input sequence:

Per layer, per attention head, per request:

Q shape: [1000, 128] → 1000 tokens × 128 dims = 128,000 numbers
K shape: [1000, 128] → same
V shape: [1000, 128] → same

For the full model (32 heads × 32 layers):

KV total: 2 × [1000, 32, 128] × 32 layers × 2 bytes (BF16)
= ~524 MB just for one 1K-token request

Notice something critical: Q is not cached. Queries are computed fresh for every new token. But K and V are different — every time the model generates a new token, it needs to attend to all previous tokens, which means it needs their K and V tensors. If you recomputed them from scratch every step, you'd redo O(T²) work. So you save them — that's the KV cache.

Why the KV Cache Exists

LLMs generate text one token at a time. At each step, the model needs to run the full attention computation over all previous tokens. Without caching:

Without KV Cache (naive)

Generate token 1 → Compute K,V for all 1 tokens

Generate token 2 → Recompute K,V for all 2 tokens

Generate token N → Recompute K,V for all N tokens

Cost: O(T²) — quadratic in sequence length

With KV Cache

Generate token 1 → Compute K,V for token 1 → Save to cache

Generate token 2 → Compute K,V only for token 2 → Append to cache

Generate token N → Look up all previous K,V from cache

Cost: O(T) — linear in sequence length

The KV cache is a performance optimization that trades memory for compute. Instead of recomputing key and value tensors for every previous token at every generation step, you save them once and read them back. This is the single most important optimization in LLM inference — without it, generating a 1,000-token response from a 1,000-token input would require roughly one million times more computation.

The Memory Problem

Here's the catch: the KV cache must live in GPU VRAM for fast access. And it's enormous.

The three specific problems from the context you shared:

1. Size Uncertainty

When a request arrives, you don't know how long the response will be. Will it generate 10 tokens? 2,000? The KV cache needs to grow as generation proceeds — but traditional GPU memory allocators work best when you can reserve a fixed block upfront. Systems would pre-allocate the maximum possible context length just to be safe, wasting huge amounts of VRAM for short requests.

Example: if max_context = 8,192 tokens, every request reserves memory for 8,192 tokens of KV cache, even if it only generates 50 tokens. For Llama 3 8B, that's ~4 GB reserved per request — whether you use it or not.

2. Contiguous Allocation

Traditional GPU memory allocators (like cudaMalloc) require a single contiguous block of memory for each allocation. KV cache for a sequence grows over time — each new token extends the tensor by one row. In a contiguous allocator, this means you must either:

Pre-allocate the maximum length upfront (wasteful), or
Reallocate and copy the entire cache when you run out of space (slow), or
Impose hard limits (inflexible)

Additionally, as requests of different lengths complete and free their memory, you get fragmentation: small gaps between allocations that individually can't fit new requests. The GPU might show 4 GB "free" but no single contiguous block larger than 500 MB — meaning you can't actually fit a new request.

3. No Sharing

In many API deployments, every request starts with the same system prompt — a set of instructions like "You are a helpful assistant..." or a company's specific instructions. The K and V tensors for that system prompt are identical across all requests, but naive systems allocate separate copies for each. If 100 requests are being served simultaneously and all share a 500-token system prompt, you're storing 100 identical copies of those KV tensors when you only need one.

How Big Is the KV Cache, Really?

Let's calculate this concretely for a few real models at BF16 (2 bytes per element):

# KV cache size formula:
# bytes = 2 × num_layers × num_heads × seq_len × head_dim × bytes_per_element
#
# The "2×" accounts for both K and V tensors

def kv_cache_bytes(num_layers, num_heads, seq_len, head_dim, dtype_bytes=2):
    return 2 * num_layers * num_heads * seq_len * head_dim * dtype_bytes

# Llama 3 8B: 32 layers, 8 KV heads (GQA!), head_dim=128
llama_8b = kv_cache_bytes(32, 8, 8192, 128)   # ~536 MB per request at 8K ctx

# Llama 3 70B: 80 layers, 8 KV heads, head_dim=128
llama_70b = kv_cache_bytes(80, 8, 8192, 128)  # ~1.34 GB per request at 8K ctx

# Qwen2.5 72B: 80 layers, 8 KV heads, head_dim=128
qwen_72b = kv_cache_bytes(80, 8, 131072, 128) # ~21.5 GB per request at 128K ctx

536 MB

Llama 3 8B · 8K context · one request

1.3 GB

Llama 3 70B · 8K context · one request

21 GB

Qwen2.5 72B · 128K context · one request

60–80%

GPU memory wasted in naive systems

Note on GQA (Grouped Query Attention): Modern models use fewer KV heads than Q heads. Llama 3 has 32 query heads but only 8 KV heads — this is Grouped Query Attention (GQA), specifically designed to reduce KV cache size. Each KV head is shared across 4 Q heads. Without GQA, the KV cache would be 4× larger.

Why This Shapes Everything in LLM Serving

Understanding tensors and the KV cache explains design decisions across the entire LLM inference stack:

PagedAttention (vLLM)

vLLM's breakthrough was applying OS paging to KV cache memory. Instead of one contiguous block per request, KV cache is split into fixed-size pages (blocks of 16 tokens each). Pages don't need to be contiguous in physical GPU memory — a logical block table maps them, just like a CPU's page table maps virtual to physical addresses. This eliminates fragmentation and over-reservation, dropping memory waste from 60–80% to under 4%.

Quantization

Every quantization decision affects the KV cache directly. Dropping from BF16 to INT8 cuts KV cache memory in half — letting you serve twice as many concurrent requests on the same hardware. FP8 KV cache cuts it by another half again. This is why FP8 inference on H100s is so impactful: it's not just the weights, it's the KV cache too.

Context Length Scaling

KV cache grows linearly with context length, but the attention computation is quadratic. Doubling context length doubles KV cache memory but quadruples attention compute. This is why 128K-context models (Qwen2.5-1M, Llama 3.1 128K) are significantly harder to serve than 8K-context models — not just the model weights, but the memory required per request balloons.

Speculative Decoding

A technique where a small "draft" model predicts several tokens ahead, and the big model verifies them all at once in a single forward pass. This works because the KV cache for the draft tokens is cheap (small model) and verification is batched. The speedup comes from generating multiple tokens with one expensive attention computation instead of one token per computation.

Prefix Caching

If the system prompt is always the same, compute its K and V tensors once, save them in a shared prefix cache, and reuse that cache for every new request. This is the "no sharing" problem solved. vLLM implements this via its copy-on-write block sharing in PagedAttention.

Mental Model Summary

🧠 The Complete Picture

A tensor is just a multi-dimensional array of numbers. Shape tells you its dimensions; dtype tells you how many bytes each number takes.

Transformer attention uses three tensors per token per layer: Query (what am I looking for?), Key (what do I contain?), Value (what information do I carry?). Q is ephemeral. K and V must be saved.

The KV cache is those saved K and V tensors — stored in GPU VRAM so generation doesn't recompute them. It's the performance foundation of all LLM inference.

The KV cache is also the primary bottleneck: it's large (hundreds of MB to tens of GB per request), grows over time, and naive systems waste 60–80% of it through fragmentation and over-allocation. Every modern serving engine — vLLM, TensorRT-LLM, SGLang — is largely defined by how cleverly it manages KV cache memory.

When you read about quantization, PagedAttention, GQA, or prefix caching — you're reading about solutions to this one problem: KV tensors are expensive, and we need to fit more of them in GPU memory.

Quick Reference: Shapes You'll See in LLM Code

Variable	Meaning	Example (Llama 3 8B)
`B`	Batch size (concurrent requests)	4, 16, 64
`T` or `seq_len`	Sequence length (tokens)	1024, 8192
`D` or `hidden_dim`	Model width (embeddings)	4096
`H` or `num_heads`	Attention heads (Q)	32
`Hkv` or `num_kv_heads`	KV heads (fewer with GQA)	8
`Dh` or `head_dim`	Per-head embedding size	128
`L` or `num_layers`	Transformer layers	32
`V` or `vocab_size`	Number of possible tokens	128,256

Now when you read: "Every input token generates a key and value tensor that must remain in GPU memory throughout the generation process" — you know exactly what that means: a tensor of shape [num_kv_heads, head_dim] per layer, in BF16, growing by one row every generated token, multiplied by the number of concurrent requests.

🧮 Tensors Explained: From Arrays to KV Cache

What Is a Tensor?

Tensors in Code

Shape and Dtype: The Two Things That Matter Most

Shape

Dtype — The Precision Tradeoff

Tensors in Neural Networks

The Transformer Attention Mechanism

How Attention Works (Step by Step)

Q, K, V — What They Actually Are

Why the KV Cache Exists

The Memory Problem

1. Size Uncertainty

2. Contiguous Allocation

3. No Sharing

How Big Is the KV Cache, Really?

Why This Shapes Everything in LLM Serving

PagedAttention (vLLM)

Quantization

Context Length Scaling

Speculative Decoding

Prefix Caching

Mental Model Summary

🧠 The Complete Picture

Quick Reference: Shapes You'll See in LLM Code

Further Reading