Eight researchers at Google asked a radical question: what if we threw out the entire playbook for how machines understand language — and replaced it with a single, beautifully simple idea?
ChatGPT, Google Translate, Siri, email autocomplete — all of these are built on a foundation called the Transformer. This is the paper that invented it. It is, by citation count, arguably the most influential AI paper ever written.
But to understand why the Transformer was revolutionary, you first need to understand what it replaced — and why the old approach was cracking under pressure.
Before the Transformer, the best language models were built on recurrent neural networks — systems like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units). These process text sequentially: word 1, then word 2, then word 3, and so on. Each word updates a hidden state — that "sticky note" — which is the model's running summary of everything so far.
This approach had two devastating problems:
Because you process word by word, you can't parallelize. Word 5 has to wait for Word 4. Word 100 has to wait for all 99 before it. On modern GPUs that thrive on parallel work, this was like using a 16-lane highway with only one car on it.
By the time the model reaches word 50, the information from word 1 has been squeezed through 49 processing steps. Long-range relationships — like a pronoun referring to a noun from a paragraph ago — get lost. The "sticky note" can only hold so much.
The Transformer's radical move was simple: throw out the conveyor belt entirely. Instead of processing words one at a time, let every word look at every other word, all at once, and figure out which words are most important for understanding each other.
This mechanism is called self-attention (sometimes called "intra-attention"). It's the only core mechanism in the Transformer. No recurrence. No convolutions. Just attention.
The result? Three transformative advantages:
Every word is processed simultaneously. Training time drops from weeks to days.
Word 1 and word 100 are connected in a single step. No more information decay.
Set new records on translation benchmarks, beating even ensembles of older models.
For each word in a sentence, the Transformer creates three things:
The attention formula computes how much each word should attend to every other word:
In plain English:
1. Compare every query with every key (the \(QK^T\) part) — this produces a score showing how relevant each word is to every other word.
2. Scale down by dividing by \(\sqrt{d_k}\) — without this, the scores would get too extreme for large dimensions, making the model focus on just one word and ignore everything else.
3. Convert to percentages (softmax) — turn the raw scores into weights that add up to 1, like a probability distribution.
4. Weighted blend of values — each word's output is a custom cocktail of information from all the words it decided to pay attention to.
Click any word below to see which other words it might attend to. The brighter the highlight, the stronger the attention. This simulates how self-attention connects distant, related words.
EXAMPLE SENTENCE
Instead of performing one big attention operation, the Transformer splits queries, keys, and values into h = 8 parallel "heads". Each head uses smaller dimensions (\(d_k = d_v = 64\) instead of the full \(d_{model} = 512\)), so the total computational cost stays roughly the same.
After all 8 heads compute their attention independently, their outputs are concatenated and projected back to the full dimension.
What each head learns to do:
The paper's attention visualizations (Figures 3–5 in the appendix) beautifully demonstrate this: different heads clearly specialize in different linguistic tasks — some tracking long-distance dependencies, others resolving pronouns, others following sentence structure.
Each word (technically, each token) is converted into a vector of 512 numbers — its embedding. These aren't random; they're learned during training so that similar words get similar vectors. The word "king" and "queen" would end up close to each other in this 512-dimensional space.
Key detail: The embedding weights are multiplied by \(\sqrt{d_{model}} = \sqrt{512} \approx 22.6\) to scale them up, and the same weight matrix is shared between the input embedding, output embedding, and the final prediction layer.
Since the Transformer processes all words simultaneously (no sequential order), it has no built-in sense of word order. "Dog bites man" and "Man bites dog" would look the same!
The fix: add a unique positional encoding to each word's embedding. The authors use sine and cosine waves of different frequencies:
Each position gets a unique pattern of values — like a fingerprint. The clever use of sinusoids means the model can learn to figure out relative distances between words (e.g., "3 positions apart"), and potentially generalize to sequences longer than those seen during training.
Fun fact: The authors also tried learned positional embeddings and found nearly identical results (Table 3, row E). They chose sinusoids for the extrapolation benefit.
The encoder is a stack of N = 6 identical layers. Each layer has two sub-layers:
Each sub-layer has a residual connection (a shortcut that adds the input to the output) followed by layer normalization. In formula: \(\text{LayerNorm}(x + \text{Sublayer}(x))\). Residual connections are like emergency exits — they let information flow directly through without being corrupted, making deep networks trainable.
The decoder is similar but has three sub-layers per layer:
Every sub-layer again uses residual connections + layer normalization.
The decoder's final output goes through a linear layer (projecting to vocabulary size) and then a softmax to produce probabilities over all possible next words. The word with the highest probability is chosen (or, during search, the top candidates are explored).
At inference time: the model uses beam search with a beam size of 4 and a length penalty α = 0.6. It generates words one at a time (auto-regressively), feeding each generated word back as input for the next step.
The paper makes a careful comparison across three criteria that matter for practical language processing. Here's how the layer types stack up:
📊 Interactive · Hover for detailsMaximum path length = the farthest two words in a sentence have to "travel" through the network to communicate. Shorter is better — shorter paths make long-range relationships easier to learn.
Minimum sequential operations = how many steps must happen one after another (can't be parallelized). Lower means more parallelization on modern GPUs.
Computational complexity per layer — how much total work is done. Self-attention's cost is O(n²·d), which is fast when n (sequence length) is smaller than d (dimension = 512), as is typical for sentences.
Note: For very long sequences where n > d, the paper suggests restricted self-attention — only attending to a local neighborhood of size r — which reduces complexity to O(r·n·d) but increases path length to O(n/r).
The details of training matter because they show just how efficient the Transformer was compared to what came before.
English→German: WMT 2014 dataset, ~4.5 million sentence pairs. Shared vocabulary of ~37,000 tokens (byte-pair encoding).
English→French: WMT 2014, ~36 million sentence pairs. 32,000 word-piece vocabulary.
Sentences batched by approximate length, ~25,000 source + 25,000 target tokens per batch.
Machine: 1 machine, 8 NVIDIA P100 GPUs.
Base model: 100K steps, ~0.4 seconds/step = 12 hours total.
Big model: 300K steps, ~1.0 seconds/step = 3.5 days total.
Compare this to competitors that trained for weeks on more GPUs!
The Transformer didn't just match existing models — it dominated them, while being dramatically cheaper to train. Here are the results on two major machine translation benchmarks.
📊 Interactive · Compare modelsWMT 2014 English-to-German (newstest2014)
BLEU score measures translation quality (higher = better). A 1-point improvement is significant; 2+ points is huge.
The Transformer (big) scores 28.4 BLEU — over 2 points above the previous best (including ensembles of multiple models). The base model alone (27.3) already beats every prior single model and every ensemble.
WMT 2014 English-to-French (newstest2014)
The Transformer (big) achieves 41.8 BLEU, a new single-model state-of-the-art, surpassing even the ConvS2S Ensemble (41.29). And it trained in just 3.5 days — a quarter of the cost of previous state-of-the-art.
Training Cost Comparison (FLOPs = floating point operations)
Lower is cheaper. Note the logarithmic scale — each step is 10× more expensive.
| Model | BLEU (EN-DE) | Training FLOPs (EN-DE) |
|---|---|---|
| GNMT + RL | 24.6 | 2.3 × 10¹⁹ |
| ConvS2S | 25.16 | 9.6 × 10¹⁸ |
| MoE | 26.03 | 2.0 × 10¹⁹ |
| GNMT + RL Ensemble | 26.30 | 1.8 × 10²⁰ |
| ConvS2S Ensemble | 26.36 | 7.7 × 10¹⁹ |
| Transformer (base) | 27.3 | 3.3 × 10¹⁸ ⚡ |
| Transformer (big) | 28.4 | 2.3 × 10¹⁹ |
The base Transformer costs 3.3 × 10¹⁸ FLOPs — roughly 3× cheaper than ConvS2S and 7× cheaper than GNMT+RL — while producing better translations than either. The big model matches the cost of GNMT+RL but scores 3.8 BLEU points higher.
Great research doesn't just show that something works — it figures out which parts make it work. The authors systematically changed one component at a time and measured the effect. All results below are on the English-to-German dev set (newstest2013).
📊 Interactive · Explore variationsKeeping total computation constant (by adjusting dk and dv proportionally), they varied the number of heads:
Sweet spot: 8 heads (the default). Single-head attention (h=1) loses 0.9 BLEU — a meaningful drop. But 32 heads (dk=16 each) also degrades slightly. Too few heads means not enough perspectives; too many heads means each perspective is too narrow.
Varying the model dimensions and number of layers:
| Configuration | N | dmodel | dff | BLEU | Params (M) |
|---|---|---|---|---|---|
| Small | 2 | 512 | 2048 | 23.7 | 36 |
| Medium-small | 4 | 512 | 2048 | 25.3 | 50 |
| Base | 6 | 512 | 2048 | 25.8 | 65 |
| More layers | 8 | 512 | 2048 | 25.5 | 80 |
| Wider (dmodel=1024) | 6 | 1024 | 4096 | 26.0 | 168 |
| Big | 6 | 1024 | 4096 | 26.4 | 213 |
Yes, bigger is generally better. Halving the layers (N=2) costs 2.1 BLEU. Doubling the model width to dmodel=1024 helps, especially in the "big" configuration with higher dropout (0.3) and longer training (300K steps).
Interestingly, shrinking dmodel to 256 (with proportionally smaller dk=32) drops to 24.5 BLEU with only 28M parameters. The dff dimension also matters — changing it from 2048 to 1024 gives 25.4, while 4096 gives 26.2.
| Residual Dropout | Label Smoothing (εls) | PPL (dev) | BLEU (dev) |
|---|---|---|---|
| 0.0 | 0.1 | 5.77 | 24.6 |
| 0.2 | 0.1 | 4.95 | 25.5 |
| 0.1 (base) | 0.1 | 4.92 | 25.8 |
| 0.1 | 0.0 | 4.67 | 25.3 |
| 0.1 | 0.2 | 5.47 | 25.7 |
Removing dropout entirely (Pdrop=0.0) costs 1.2 BLEU points — the model overfits significantly. Label smoothing also helps BLEU despite hurting perplexity (the model becomes less confident on individual predictions but more accurate overall).
Reducing dk (attention key dimension): When dk is reduced from 64 to 16 (row B), BLEU drops from 25.8 to 25.1. This suggests that computing word-to-word compatibility is a hard problem that benefits from higher-dimensional comparisons.
Learned vs. sinusoidal positional encoding (row E): Replacing the sinusoidal positional encoding with learned positional embeddings produces BLEU of 25.7 vs. 25.8 — virtually identical. The authors chose sinusoids because they might generalize to longer sequences.
To test whether the Transformer was a one-trick pony or a genuinely versatile architecture, the authors applied it to English constituency parsing — with minimal task-specific tuning. They used a 4-layer Transformer with dmodel=1024, trained on:
| Parser | Setting | F1 Score (WSJ §23) |
|---|---|---|
| Vinyals & Kaiser (2014) | WSJ only | 88.3 |
| Petrov et al. (2006) | WSJ only | 90.4 |
| Zhu et al. (2013) | WSJ only | 90.4 |
| Dyer et al. (2016) | WSJ only | 91.7 |
| Transformer (4 layers) | WSJ only | 91.3 |
| Zhu et al. (2013) | Semi-supervised | 91.3 |
| McClosky et al. (2006) | Semi-supervised | 92.1 |
| Vinyals & Kaiser (2014) | Semi-supervised | 92.1 |
| Transformer (4 layers) | Semi-supervised | 92.7 |
| Luong et al. (2015) | Multi-task | 93.0 |
| Dyer et al. (2016) | Generative | 93.3 |
With just 40K training sentences and no task-specific tuning, the Transformer (91.3 F1) nearly matches the best discriminative parser (91.7). In the semi-supervised setting, it achieves 92.7 — the best result among comparable methods, surpassing all previously reported semi-supervised parsers.
This was a powerful early signal that the Transformer wasn't just good at translation — it was a general-purpose architecture for sequence problems.
The positional encodings are sine and cosine waves at different frequencies. Move the slider to see how the encoding pattern changes across positions and dimensions. Each row is a position; each column is a dimension.
Warm colors (gold/red) = positive values, cool colors (blue/purple) = negative values. Notice how the low-frequency waves on the right change slowly across positions, while the high-frequency waves on the left oscillate rapidly. This gives each position a unique "fingerprint."
The Transformer uses an unusual learning rate strategy: ramp up quickly, then slowly decay. Adjust the warmup period to see how it changes the schedule.
The peak learning rate occurs exactly at the warmup step. After that, the rate decays as 1/√step. With dmodel = 512, the peak rate is about 0.00098.
| Layers (N) | 6 |
| dmodel | 512 |
| dff | 2048 |
| Heads (h) | 8 |
| dk = dv | 64 |
| Dropout | 0.1 |
| Parameters | 65M |
| Training | 100K steps / 12h |
| Layers (N) | 6 |
| dmodel | 1024 |
| dff | 4096 |
| Heads (h) | 16 |
| dk = dv | 64 |
| Dropout | 0.3 |
| Parameters | 213M |
| Training | 300K steps / 3.5d |
In their conclusion, the authors wrote: "We are excited about the future of attention-based models and plan to apply them to other tasks." That turned out to be perhaps the greatest understatement in AI history.
Here's what you could confidently explain to someone after reading this:
The Transformer replaced sequential processing (RNNs) with self-attention — a mechanism that lets every word in a sentence directly interact with every other word, all at once. This is faster to train, better at capturing long-range relationships, and produces superior results.
Running 8 parallel attention mechanisms (heads) lets the model simultaneously track different types of linguistic relationships — syntax, semantics, pronoun references, and more.
28.4 BLEU on English→German (+2 over prior best), 41.8 on English→French (new single-model SOTA). The base model trained in just 12 hours on 8 GPUs — a fraction of the cost of competitors.
The Transformer isn't just good at translation. With minimal modification, it matched or beat specialized parsers on English constituency parsing — hinting at its potential as a universal sequence-processing architecture.
This paper is the foundation of GPT, BERT, ChatGPT, Gemini, and essentially all modern large language models. The Transformer architecture became the backbone of the AI revolution that followed.