The paper that introduced the Transformer — the architecture behind GPT, BERT, and virtually every modern AI system. No recurrence, no convolution — just attention.
Before the Transformer, language models relied on recurrent neural networks (RNNs) that processed words one at a time — like reading a book one letter at a time, never being able to skip ahead.
RNNs must process tokens one after another. You can't parallelize them — each step depends on the previous one. This makes training painfully slow.
RNNs struggle with long-range dependencies. By the time they reach the end of a long sentence, they've often "forgotten" the beginning.
The Transformer processes all words simultaneously using "attention" — letting every word directly look at every other word, no matter how far apart.
Imagine reading a sentence. When you see the word "it," your brain instantly looks back to figure out what "it" refers to. That's attention — letting each word "look at" other words to understand context.
Self-Attention means each word in a sentence computes how relevant every other word is to it. The word "bank" might attend strongly to "river" (riverbank) or "money" (financial bank) — the attention mechanism figures out the right context automatically.
Click any word to see what it "attends to" — thicker lines mean stronger attention.
The core computation is elegant and simple. For each word, we compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I give out?).
Instead of computing attention once, the Transformer does it 8 times in parallel — each "head" can learn to focus on different types of relationships (grammar, meaning, position, etc.).
Each head learns a different attention pattern. Click to see simulated patterns.
With 8 heads and dmodel = 512, each head works on 64 dimensions. The results are concatenated and projected back to 512 dimensions. The total computation is the same as single-head attention with full dimensionality — but the model learns richer representations!
The Transformer follows an encoder-decoder pattern. The encoder reads the input, and the decoder generates the output one token at a time.
Each layer has two sub-layers:
Each sub-layer uses a residual connection (x + sublayer(x)) followed by layer normalization.
Each layer has three sub-layers:
dmodel = 512 everywhere · 8 attention heads · Feed-forward inner dimension = 2048
Since the Transformer has no recurrence (no "time steps"), it has no idea what order the words are in! The solution: add sinusoidal position signals to the input embeddings.
Visualization of positional encodings — each row is a position, each column a dimension. Colors represent sine/cosine values (hover for details).
Self-attention connects any two positions in O(1) operations — recurrence needs O(n) steps, and convolution needs O(log n) layers.
| Layer Type | Complexity / Layer | Sequential Ops | Max Path Length |
|---|---|---|---|
| ⭐ Self-Attention | O(n² · d) | O(1) | O(1) |
| Recurrent (RNN) | O(n · d²) | O(n) | O(n) |
| Convolutional | O(k · n · d²) | O(1) | O(logk(n)) |
Key insight: The maximum path length for self-attention is O(1) — meaning any two words can directly communicate in a single layer. For RNNs, a signal must pass through O(n) steps, making long-range dependencies much harder to learn.
The paper introduced a distinctive learning rate schedule: linearly increase for the first 4,000 steps, then decay proportional to the inverse square root of the step number.
The Transformer demolished existing benchmarks on machine translation — achieving higher quality while being dramatically faster and cheaper to train.
English→German score, beating ensembles of other models by +2.0 BLEU
English→French score, new single-model state-of-the-art
Training time on 8 GPUs — a fraction of the cost of competing models
The authors systematically varied components to understand what matters. Here's what they found:
Single-head attention drops 0.9 BLEU. Too many heads (32) also hurts. 8 heads strikes the right balance.
Increasing dmodel from 512→1024 and dff from 2048→4096 improved results, at the cost of more parameters.
Removing dropout hurts significantly. The big model uses Pdrop = 0.3 and label smoothing ε = 0.1.
Sinusoidal positional encodings perform nearly identically to learned ones — and can generalize to longer sequences.
Published in 2017, "Attention Is All You Need" didn't just improve translation — it laid the foundation for the entire modern AI revolution.
OpenAI's GPT-1 through GPT-4 are all based on the Transformer decoder. ChatGPT? Pure Transformer.
Google's BERT uses the Transformer encoder for understanding language — powering search and NLP worldwide.
Vision Transformers (ViT), DALL-E, AlphaFold — the architecture spread to images, proteins, music, and more.