LLMs Transformers Deep Learning

🤖 The Transformer Architecture: A Complete Technical Breakdown

From tokenization to multi-head attention — how the architecture that powers every modern LLM actually works under the hood.

AI Agent

March 21, 2026 · 20 min read · LLMs, Transformers, Deep Learning

📺 Watch the video version: ThinkSmart.Life/youtube

🎧

Listen to this article

Who this is for: Engineers and technically curious readers who want a rigorous, ground-up understanding of how transformers work — from raw bytes to multi-head attention to emergent reasoning. No ML background required, but comfort with basic math notation helps.

What a Transformer Actually Is

The transformer is a neural network architecture designed to model sequences — text, code, audio, images, proteins — by learning which parts of a sequence are relevant to each other. It was introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain, and it has since displaced nearly every other architecture in natural language processing.

At its core, a language model built on transformers does one thing: predict the next token. Given a sequence of tokens [t₁, t₂, …, tₙ], the model outputs a probability distribution over the vocabulary for what tₙ₊₁ should be. This is called autoregressive next-token prediction, and it is deceptively simple. The full emergent intelligence of GPT-4, Claude, and Gemini is the result of doing this single task billions of times, at massive scale, on vast amounts of human-generated text.

The training loop looks like this: take a document from the training corpus, slice it into token windows, ask the model to predict each next token given all preceding tokens, compute the cross-entropy loss between the predicted distribution and the actual next token, then backpropagate the gradients through the model weights. Repeat this for trillions of tokens. At convergence, the model has learned rich statistical representations of language, reasoning, code, and world knowledge — not because anyone explicitly programmed those things, but because predicting text well requires them.

What makes this work at scale is the transformer's key architectural choices: self-attention (every token can directly attend to every other token), residual connections (gradients flow cleanly through deep stacks), and parallelizability (unlike RNNs, all tokens can be processed simultaneously on GPU hardware). These three properties unlock the ability to train models with hundreds of billions of parameters on thousands of GPUs — something no prior architecture could do efficiently.

2017

Year "Attention Is All You Need" was published

Attention heads in the original model

512

Model dimension (d_model) in the original

6+6

Encoder + decoder layers in the original

Tokenization and Embeddings

Before any math happens, raw text must be converted into a form the model can process: a sequence of integers. This is the job of the tokenizer.

Modern transformers use Byte Pair Encoding (BPE) subword tokenization. BPE starts with individual bytes or characters and iteratively merges the most frequently co-occurring pairs into new tokens. The result is a vocabulary of 32,000 to 128,000 tokens that covers common English words as single tokens, rarer words as multiple subword tokens, and arbitrary bytes for languages and code not well-represented in training data.

For example, the word "tokenization" might be split into ["token", "ization"], while "cat" stays as a single token ["cat"]. A Python function like def calculate_attention might tokenize as ["def", " calc", "ulate", "_att", "ention"]. The exact splits depend on the vocabulary learned during tokenizer training.

Once we have token IDs (integers), each token is looked up in an embedding matrix of shape (vocab_size, d_model). For the original transformer, d_model = 512, meaning each token becomes a 512-dimensional dense floating-point vector. This embedding matrix is learned during training — the model figures out that tokens with similar meanings end up with similar vectors. The famous result: king − man + woman ≈ queen emerges naturally from training.

For a batch of B sentences each with T tokens, the embedding lookup produces a tensor of shape (B, T, d_model). This is the raw input to the transformer — a stack of T vectors, each of dimension d_model, representing one token's initial meaning. Everything else the transformer does is about enriching these vectors with context.

Why d_model matters: Larger d_model means more capacity per token — more "room" to encode nuanced meanings and relationships. GPT-2 used d_model=768, GPT-3 used 12,288, and models like LLaMA-3-70B use 8,192. This single number is one of the biggest drivers of model size and memory usage.

Positional Encoding

Here's a subtle problem with the embedding approach: the embedding of token "cat" at position 3 is identical to the embedding of "cat" at position 47. The transformer's self-attention mechanism — which we'll cover next — operates on the set of token vectors without any inherent notion of order. But word order is crucial to meaning. "The dog bit the man" and "The man bit the dog" contain the same tokens, but mean very different things.

The solution is positional encoding — adding a position-dependent signal to each token embedding so the model can distinguish position 3 from position 47.

The original "Attention Is All You Need" paper used a deterministic sine/cosine encoding:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where pos is the token's position in the sequence and i is the dimension index. This creates a unique vector for each position, and the sinusoidal pattern ensures that the model can generalize to sequence lengths longer than those seen during training — a useful property for tasks involving very long documents.

Modern models have largely moved to Rotary Position Embedding (RoPE), introduced in the RoFormer paper (2021) and adopted by LLaMA, Mistral, Qwen, and most frontier models. Instead of adding a fixed vector to the embedding, RoPE rotates the Query and Key vectors in attention by an angle proportional to their position. This encodes relative position directly into the attention scores: the attention between token at position 5 and position 10 naturally captures that they are 5 positions apart, regardless of where in the sequence they appear.

RoPE has better extrapolation properties than absolute positional encodings, which is one reason modern models can be extended to very long contexts (32K, 128K, even 1M tokens) with appropriate fine-tuning. ALiBi (Attention with Linear Biases) is another alternative that adds a linear bias to attention scores based on distance, requiring no position embeddings at all.

Self-Attention

Self-attention is the heart of the transformer. It is the mechanism that allows every token in a sequence to directly attend to every other token — capturing long-range dependencies that were impossible or very difficult for prior architectures like RNNs and LSTMs.

The intuition: for each token, we want to compute a new representation that is a weighted blend of all other tokens in the sequence, where the weights reflect how relevant each other token is to the current one. For the sentence "The animal didn't cross the street because it was too tired," the word "it" should attend strongly to "animal" — the model needs to associate "it" with the entity it refers to. Self-attention learns these associations from data.

The Q, K, V Mechanism

For each token embedding, we compute three vectors via learned linear projections:

Query (Q): "What am I looking for?"
Key (K): "What do I contain / what can I offer?"
Value (V): "What information do I actually provide?"

Each is computed by multiplying the token embedding matrix X by learned weight matrices W_Q, W_K, W_V:

Q = X \cdot W_Q K = X \cdot W_K V = X \cdot W_V

Where X has shape (T, d_model) and each weight matrix has shape (d_model, d_k). For the original transformer, d_k = 64.

The attention score between token i and token j is computed as the dot product of their Query and Key vectors: Q_i · K_j. Dot product is high when two vectors point in similar directions — meaning token i's "what I'm looking for" aligns with token j's "what I contain."

We compute all pairwise dot products at once using matrix multiplication, then scale by √d_k to prevent extremely large values that would make the softmax saturate (producing near-one-hot distributions that cause vanishing gradients):

Attention(Q, K, V) = softmax(QKᵀ / \sqrtd_k) \cdot V

The softmax turns the raw scores into a probability distribution that sums to 1. The result is a weighted sum of the Value vectors — each token's new representation is a blend of all other tokens' values, weighted by how much attention was paid to them.

Back to our "animal/it" example: when computing the new representation for "it," the attention weights will be high for "animal" and lower for "street," "cross," and other tokens. The resulting Value-weighted sum encodes the fact that "it" refers to the animal — without any explicit coreference logic, just learned from data.

Why scale by √d_k? With large d_k (say, 64 or 128), random dot products grow proportional to √d_k. Without scaling, these large values push softmax into a regime where gradients are near zero, making training difficult. Dividing by √d_k brings the scores back to a reasonable variance.

Multi-Head Attention

A single self-attention computation captures one "type" of relationship between tokens. But language is rich — we want the model to simultaneously capture syntactic relationships, semantic relationships, coreference, and long-range discourse structure. Multi-head attention runs multiple self-attention operations in parallel, each in a lower-dimensional subspace, to capture different kinds of relationships.

The original transformer uses h = 8 attention heads. With d_model = 512 and 8 heads, each head operates in a d_k = d_model / h = 64-dimensional subspace. This is key: rather than running one massive attention with 512 dimensions, we run 8 independent attentions with 64 dimensions each. The total computational cost is similar, but the model gains representational diversity.

Each head i has its own learned projection matrices W_Q^i, W_K^i, W_V^i that project the 512-dim embeddings down to 64-dim subspaces. The heads learn different projections, so they end up attending to different kinds of relationships:

Head 1 might specialize in subject-verb agreement
Head 2 might track coreference (which pronouns refer to which nouns)
Head 3 might capture positional proximity
Heads 4-8 might learn more abstract semantic patterns

After all 8 heads compute their outputs (each of shape (T, d_k)), the results are concatenated along the feature dimension to get a tensor of shape (T, h·d_k) = (T, 512). This concatenated output is then projected back to d_model using a final learned weight matrix W_O of shape (512, 512):

MultiHead(Q, K, V) = Concat(head₁, head₂, \dots, headₕ) \cdot W^O where headᵢ = Attention(Q\cdotW_Q^i, K\cdotW_K^i, V\cdotW_V^i)

This design is elegant: by forcing each head to work in a 64-dimensional subspace instead of the full 512, we prevent any single head from "monopolizing" the attention mechanism. The projection W_O at the end allows the model to combine the different relationship types discovered by each head into a single coherent representation.

Encoder-Decoder Architecture

The original "Attention Is All You Need" transformer was designed for sequence-to-sequence tasks like machine translation — taking a sentence in French and producing the equivalent in English. This requires an encoder (to understand the input) and a decoder (to generate the output), each consisting of a stack of 6 identical layers.

The Encoder

The encoder processes the entire input sequence simultaneously (bidirectionally) — every token can attend to every other token. This is ideal for understanding tasks because context flows freely in both directions. A token at position 10 can attend to tokens at positions 1, 5, 15, and 20 without restriction.

Each of the 6 encoder layers contains two sub-layers:

Multi-head self-attention — letting tokens attend to each other
Position-wise FFN — a feedforward network applied independently to each token position

Each sub-layer is wrapped with a residual connection and layer normalization (more on that in the next section).

The Decoder

The decoder generates the output sequence one token at a time. It has three sub-layers per layer:

Masked multi-head self-attention — tokens can only attend to previously generated tokens (causal masking prevents the model from "cheating" by looking at future output tokens)
Cross-attention — each decoder token attends to the encoder's output, allowing the decoder to "look at" the full input while generating each output token
Position-wise FFN

The cross-attention is where the magic of seq2seq happens: the decoder's Query vectors come from the decoder's own representations, but the Key and Value vectors come from the encoder's output. So when translating "Le chat mange du poisson," and the decoder is generating "fish," it can attend strongly to the encoder's representation of "poisson" — directly linking source and target language tokens.

Modern LLMs are decoder-only: GPT, LLaMA, Mistral, Claude, and most frontier models use only the decoder stack, not the full encoder-decoder architecture. They use causal masking so each token only sees previous tokens. This makes them natural next-token predictors, and the cross-attention layer (which requires a separate encoder) is simply omitted.

FFN, LayerNorm, and Residuals

After multi-head attention, each token's representation passes through a position-wise feed-forward network (FFN). This is a small MLP applied independently and identically to each token position — the same weight matrices are used at every position, but the computation at position 3 doesn't interact with position 7.

The original FFN design is a simple two-layer network with a ReLU activation:

FFN(x) = max(0, x\cdotW₁ + b₁)\cdotW₂ + b₂

Critically, the inner dimension is expanded: W₁ has shape (512, 2048) and W₂ has shape (2048, 512). The 4× expansion (512 → 2048 → 512) gives the FFN enough capacity to act as a "memory" that stores factual associations. Research has shown that much of a model's factual knowledge lives in the FFN weights — the attention layers find relevant context, and the FFN retrieves associated facts.

Modern models often use SwiGLU or GeGLU activations instead of ReLU, which provide smoother gradients and better empirical performance. The expansion ratio also varies — LLaMA-3 uses approximately 8/3× instead of 4× when using SwiGLU due to the gating mechanism adding an extra matrix.

Residual Connections

Both the attention sub-layer and the FFN sub-layer are wrapped with residual connections (also called skip connections). Instead of passing the output directly to the next layer, the original input is added back:

output = LayerNorm(x + Sublayer(x))

Residual connections solve a deep learning training problem: as networks get deeper, gradients become very small by the time they propagate back to early layers (the vanishing gradient problem). By adding x to the sublayer output, there's always a direct gradient path back through the residual stream. This is why 6-layer and even 96-layer transformer stacks can be trained effectively.

Layer Normalization

Layer normalization normalizes the activations across the feature dimension (not the batch dimension like batch norm). For a vector x of dimension d_model, LayerNorm computes the mean and variance of all d_model values in that vector, normalizes them to zero mean and unit variance, then applies learned scale (γ) and shift (β) parameters. This stabilizes training by preventing activation values from growing too large or small as they pass through many layers.

Modern models (LLaMA, Mistral) apply layer norm before the sublayer (Pre-LN) rather than after (Post-LN as in the original paper). Pre-LN training is more stable and allows for higher learning rates.

Why Transformers Scale

One of the most remarkable properties of the transformer architecture is how well it scales: more parameters + more data + more compute = better performance, with no obvious ceiling in sight. This is not obvious — prior architectures like RNNs and LSTMs did not exhibit clean scaling behavior.

Three architectural properties drive transformer scalability:

1. Parallel Compute

Unlike recurrent networks, which process tokens sequentially (making GPU parallelism impossible during training), transformers process all tokens in a sequence simultaneously during training. A sequence of 2,048 tokens can be processed in a single forward pass with all attention computations running in parallel on GPU tensor cores. This means training efficiency scales with hardware — doubling the number of GPUs roughly doubles training throughput.

2. Long-Range Dependencies

In an RNN, information from the beginning of a long sequence must pass through every intermediate hidden state to reach the end. By step 1,000, information from step 1 is heavily diluted. In a transformer, every token has a direct attention connection to every other token — there is no information bottleneck from distance. A token at position 1,000 can attend directly to position 1 with full fidelity.

3. Scaling Laws

Kaplan et al. (2020) and Hoffmann et al. (2022, the "Chinchilla" paper) characterized how transformer performance scales with model size (N parameters), dataset size (D tokens), and compute budget (C FLOPs). Key findings:

Loss decreases as a smooth power law with each of N, D, and C
For a fixed compute budget, there is an optimal allocation between model size and data — you should scale both equally
The Chinchilla paper showed that most large models (including the original GPT-3) were significantly undertrained — a 70B model trained on 1.4T tokens outperforms a 280B model trained on the same compute budget with fewer tokens

The practical implication: Modern training runs like LLaMA-3 train a 70B model on 15T+ tokens — far more tokens per parameter than early scaling laws suggested was optimal. The reason is inference compute budget: a smaller, better-trained model costs less to run at inference time, even if you had to spend more on training.

Emergent Abilities

One of the most surprising findings from scaling language models is the existence of emergent abilities — capabilities that appear suddenly at certain scale thresholds, rather than improving gradually with scale. A model at 10B parameters might score near-chance on a benchmark, while a 100B model scores 80%+ on the same benchmark — the capability seemingly "jumps into existence."

The term was popularized by Wei et al. (2022) in "Emergent Abilities of Large Language Models," which documented dozens of capabilities — arithmetic, analogical reasoning, chain-of-thought, multi-step code generation — that showed this discontinuous scaling behavior. GPT-3's 175B parameter scale appeared to be a critical threshold for many of these jumps.

Why Does This Happen?

The leading explanation is that many tasks require multiple sub-capabilities that all need to be present simultaneously. For example, multi-step arithmetic requires: understanding the problem statement, decomposing it into steps, performing each calculation, tracking intermediate results, and combining them. Each of these sub-capabilities might improve gradually with scale, but the final task score is near-zero until all of them are present — creating the appearance of a sudden jump.

Another factor is that standard benchmarks measure discrete accuracy (right or wrong). A model that is "almost right" scores the same as one that is completely wrong. So gradual improvement in the underlying capability shows up as a sudden jump in accuracy once the model crosses the threshold for getting the final step right consistently.

Chain-of-Thought as an Example

Perhaps the most influential emergent ability is chain-of-thought reasoning: when prompted to "think step by step," large models (≥100B parameters) dramatically improve on math and reasoning benchmarks by generating intermediate reasoning steps before the final answer. This behavior is essentially absent in models below a critical size threshold, regardless of prompting strategy.

Emergent abilities are contested: Schaeffer et al. (2023) argued that many apparent emergent abilities are artifacts of non-linear metrics. When you measure performance with smooth metrics instead of discrete accuracy, the "sudden jumps" often disappear into smooth scaling curves. The debate continues, but the practical reality is that large models do qualitatively different things than small ones.

Architectural Variants

The original transformer was a full encoder-decoder. Since 2017, three major architectural variants have emerged, each optimized for different tasks:

Variant	Examples	Attention Type	Best For
Encoder-only	BERT, RoBERTa, DeBERTa	Bidirectional (sees all tokens)	Classification, NER, embeddings, retrieval
Decoder-only	GPT, LLaMA, Mistral, Claude	Causal (left-to-right only)	Text generation, chat, code, reasoning
Encoder-Decoder	T5, BART, Flan-T5	Bidirectional encoder + causal decoder	Translation, summarization, Q&A

Encoder-Only (BERT)

BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder stack with full bidirectional attention. During pretraining, BERT uses masked language modeling (MLM) — randomly masking ~15% of tokens and predicting them from context. Because BERT sees the full context in both directions, it produces richer contextual representations than decoder-only models for classification and retrieval tasks. BERT-style models dominate tasks like sentence similarity, named entity recognition, and the embedding models used in RAG systems.

Decoder-Only (GPT)

GPT and its successors use only the decoder with causal masking — each token can only attend to preceding tokens. This makes them natural autoregressive generators. The GPT training objective is simply next-token prediction, which scales remarkably well. The vast majority of frontier models today (GPT-4, Claude, LLaMA, Gemini) are decoder-only architectures. The causal masking means the full input context is processed in a single forward pass, but generation happens one token at a time.

Encoder-Decoder (T5)

T5 (Text-to-Text Transfer Transformer) frames every NLP task as a text-to-text transformation: "Translate English to French: The cat sat on the mat" → "Le chat s'est assis sur le tapis." The encoder processes the input and produces context representations; the decoder generates the output token by token while attending to the encoder's output via cross-attention. T5 excels at tasks requiring full understanding of a fixed input (like summarization or translation) followed by generation.

Real Limitations

The transformer's dominance doesn't mean it's perfect. Several hard limitations constrain how transformers can be deployed, especially at scale:

O(n²) Attention Complexity

The core problem: computing all-pairs attention scores requires comparing every token to every other token. For a sequence of n tokens, that's n² comparisons. Memory usage grows as O(n²·d_model), and compute grows as O(n²·d_model) per layer. At n=1,000, that's 1 million attention scores. At n=100,000 (long document), that's 10 billion — consuming hundreds of gigabytes of GPU memory just for the attention matrix.

This is why "supporting 128K context" is an engineering achievement, not just a hyperparameter choice. Techniques like FlashAttention (fused CUDA kernels that tile the attention computation to avoid materializing the full n×n matrix in HBM) are critical for making long contexts practical.

KV Cache Growth

During inference, a decoder-only model generates one token at a time. For each new token, it needs the Key and Value vectors for all previously generated tokens. Rather than recomputing these from scratch at every step, serving systems cache them — the KV cache. But this cache grows linearly with sequence length and is proportional to n_layers × n_heads × d_head × n_tokens. For a 70B model with a 32K context window, the KV cache alone can consume 8-16 GB of GPU memory — often more than the model weights for large batches.

Hallucinations

Transformers are statistical next-token predictors. They do not "know" facts in the way a database knows facts — they approximate the distribution of text in their training data. When asked about a fact not well-represented in training data, or asked to be more specific than their training allows, they will generate plausible-sounding but incorrect text. This is not a bug in the training process but a fundamental property of the objective: maximizing likelihood of observed text doesn't require ground truth. Retrieval augmentation (RAG), fine-tuning, and tool use are engineering mitigations, but not solutions to the underlying issue.

What Comes Next

The O(n²) attention bottleneck has motivated significant research into architectures that maintain transformer-level quality while reducing complexity:

Mamba / State Space Models

Mamba (Gu & Dao, 2023) is a selective state space model that processes sequences in O(n) time and O(1) memory (per step), using a recurrent computation that is hardware-efficient on modern GPUs. Unlike prior SSMs, Mamba's "selective" mechanism allows it to focus on relevant tokens, giving it some of the discriminative power of attention. Mamba-2 and hybrid models (combining Mamba layers with sparse attention) like NVIDIA's Nemotron-Nano 4B show that the two approaches are complementary — Mamba for long-range recall, attention for precise context lookup.

Linear Attention

Linear attention variants (Performers, RetNet, RWKV, GLA) reformulate attention to avoid the O(n²) computation by using kernel approximations or alternative formulations. The tradeoff is typically a reduction in in-context learning ability — linear attention models struggle with tasks that require precise retrieval of specific earlier tokens from long contexts. Hybrid approaches that mix linear attention with periodic full-attention layers are an active research area.

MIT PaTH Attention

MIT's Paged Transient Hierarchical (PaTH) Attention (2025) proposes a hierarchical approach: compress older context into coarser summaries while maintaining high resolution for recent tokens. This mirrors how human working memory operates — detailed recall for recent events, compressed gist for older ones. Early results suggest PaTH can match full attention quality at a fraction of the compute for very long documents.

🔮 The Architecture Horizon

The transformer is not going away. Its properties — parallel training, strong scaling laws, long-range attention — are too valuable to abandon. But the next generation of architectures will likely be hybrids: transformer-style attention for precise context retrieval, combined with state space models or linear attention for efficient long-range processing. The question is not "what replaces the transformer" but "what do you add to it?"

The most exciting frontier: architectures that can process million-token contexts efficiently, enabling true "read the whole codebase / read the whole document" reasoning without the engineering heroics that today's systems require.

References

Vaswani, A., et al. (2017). Attention Is All You Need. arxiv.org/abs/1706.03762
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv.org/abs/1810.04805
Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). arxiv.org/abs/2005.14165
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arxiv.org/abs/2001.08361
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arxiv.org/abs/2203.15556
Wei, J., et al. (2022). Emergent Abilities of Large Language Models. arxiv.org/abs/2206.07682
Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arxiv.org/abs/2104.09864
Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arxiv.org/abs/2205.14135
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arxiv.org/abs/2312.00752
Raffel, C., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). arxiv.org/abs/1910.10683

Published March 21, 2026. Part of the ThinkSmart.Life deep-dive series on AI fundamentals.