NIPS 2017 · The Most Cited AI Paper of the Decade

What if a computer could read
an entire sentence at once?

In 2017, eight researchers at Google introduced the Transformer — an architecture so powerful it became the backbone of ChatGPT, Google Translate, and virtually every modern AI system. Here's how it works, and why it changed everything.

By Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin · Published 2017

↓ Scroll to explore

The World Before the Transformer

To understand why this paper mattered, you need to understand the frustrating bottleneck it solved.

Imagine reading a book — but you can only see one word at a time, and you must remember everything you've read so far in a single note you keep rewriting. By the time you reach the end of a long sentence, the details from the beginning have faded. That's essentially how Recurrent Neural Networks (RNNs) worked — the dominant AI architecture before the Transformer.

RNNs processed language one word at a time, like a conveyor belt. Each word got processed only after the previous one was done. This created two massive problems:

🐌 Painfully Slow

Because words were processed sequentially (one after another), you couldn't use modern GPUs to process multiple words simultaneously. Training took forever — weeks or months.

🧠 Forgetful

Information from the start of a sentence had to pass through every subsequent word to reach the end. Long-range dependencies — like a pronoun referring to a noun 20 words back — got lost.

Improvements like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks helped with the forgetting problem, but they couldn't fix the speed issue. The sequential bottleneck was fundamental.

Other researchers tried using Convolutional Neural Networks (CNNs) — the technology behind image recognition — to process language. Models like ByteNet and ConvS2S could process words in parallel (fast!), but they had their own limitation: they could only "see" nearby words at each layer. To connect a word at position 1 with a word at position 100, you needed many stacked layers — the path length grew logarithmically or linearly with distance.

The Transformer's key insight was that self-attention connects every word to every other word in a single step — a constant path length of just \(O(1)\), regardless of distance.

So the field needed something fundamentally different — a way for every word to talk to every other word, all at once. Enter: attention.

The Core Idea: Attention

Attention wasn't invented by this paper — it already existed as an add-on to RNNs. But this paper asked a radical question: what if attention was the only thing you needed?

Think of attention like a spotlight at a party. You're trying to understand what someone just said. Your brain doesn't replay the entire conversation from the start — it selectively focuses on the most relevant things said earlier. "Wait, she mentioned Paris earlier — that's what 'there' refers to!" That selective spotlight is attention.

In technical terms, attention works with three things — imagine them as three roles people play at a reference desk:

🔍 Query (Q) — "What am I looking for?"

The current word's question. When translating "chat" in a French sentence, the Query asks: "Which English words are most relevant to me right now?"

🏷️ Key (K) — "Here's my label"

Each word advertises what kind of information it contains. Like name tags at the party — they help the Query figure out who to pay attention to.

📦 Value (V) — "Here's my actual content"

Once you know who to pay attention to (via Query-Key matching), you pull their actual information — their Value — in proportion to how relevant they are.

Scaled Dot-Product Attention

The paper's specific attention formula is elegantly simple. In plain English: compare the Query with every Key (via a dot product), scale the result down so numbers don't explode, convert to percentages using softmax, then use those percentages to create a weighted blend of all Values.

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

QK^T → "How similar is my question to each word's label?" (dot product)

÷ √d_k → Scale down so large numbers don't push softmax into regions where learning stalls (d_k = 64 in this paper)

softmax → Convert raw scores into percentages that add up to 100%

× V → Blend the Values according to those percentages

When the key dimension \(d_k\) is large, the dot products between queries and keys get very big. Imagine adding up 64 random numbers — the sum has much more variance than a single number. Specifically, if each component has variance 1, the dot product has variance \(d_k\).

Large values push the softmax function into extreme territory where almost all the weight goes to one item and gradients (the signals used for learning) become tiny. Dividing by \(\sqrt{d_k}\) brings the variance back to 1, keeping softmax in its "sweet spot" where learning flows smoothly.

🎮 Try it: See Attention in Action

Click any word below to see which other words it "attends to" — brighter highlights mean stronger attention. This simulates how the Transformer understands word relationships.

Click a word to explore its attention pattern

Multi-Head Attention: Eight Spotlights at Once

A single attention mechanism is powerful, but it has a limitation — it can only focus on one type of relationship at a time. The paper's breakthrough was to run multiple attention mechanisms in parallel.

Imagine watching a movie with eight different critics. One notices the cinematography. Another tracks the plot. A third focuses on character arcs. A fourth catches foreshadowing. Each sees the same movie but notices different things. That's Multi-Head Attention — 8 parallel "heads," each learning to focus on different linguistic relationships.

Concretely, the model splits its 512-dimensional representations into 8 heads of 64 dimensions each (\(d_k = d_v = 512/8 = 64\)). Each head independently runs the full attention calculation, then their outputs are concatenated and projected back to 512 dimensions. This keeps computational cost roughly equal to a single full-sized attention head.

What different heads learn

The paper's appendix reveals fascinating behavior: some heads learn syntactic patterns (connecting verbs to their objects), others handle anaphora resolution (figuring out that "its" refers to "The Law"), and some track long-distance dependencies (linking "making" to "more difficult" across many intervening words).

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\] \[\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Each head has its own learned projection matrices (\(W_i^Q, W_i^K, W_i^V\)) that transform the shared input into that head's unique "perspective." The final output projection \(W^O\) combines all heads' findings into a unified representation. With \(h = 8\) heads, each operating on 64 dimensions, the total work is similar to one head with 512 dimensions.

The Full Transformer Architecture

Now that you understand attention, let's see how the complete system is assembled. Click each component to learn what it does.

The Transformer has two halves — an Encoder (reads the input) and a Decoder (generates the output) — each built from stacked identical layers.

The encoder reads the input sentence and builds a rich understanding of it. It has 6 identical layers, each containing two sub-layers:

Multi-Head Self-Attention

Every word looks at every other word in the input to understand context. "Bank" attends to "river" to know it means riverbank, not a financial institution.

Feed-Forward Network

After attention gathers context, this two-layer neural network (512 → 2048 → 512 dimensions) processes each word's enriched representation independently.

Each sub-layer also includes a residual connection (a shortcut that adds the input back to the output, preventing information loss) and layer normalization (keeping numbers in a healthy range). The formula: LayerNorm(x + Sublayer(x)).

The decoder generates the output one token at a time. It also has 6 identical layers, but with three sub-layers:

Masked Multi-Head Self-Attention

Like the encoder's self-attention, but with a crucial constraint: each position can only attend to positions before it. No peeking ahead! This "mask" ensures the model generates words in proper order.

Encoder-Decoder Attention

The decoder looks back at the encoder's output. Queries come from the decoder ("What French word should I generate next?"), while Keys and Values come from the encoder ("Here's the full understanding of the English input").

Feed-Forward Network

Same as the encoder — processes each position independently with two linear transformations.

Positional Encoding — Teaching Word Order

Since attention processes all words simultaneously (no conveyor belt!), it has no built-in sense of word order. "Dog bites man" and "Man bites dog" would look the same without help.

It's like stamping each word with its seat number in a theater. The model adds a unique "position signal" to each word's representation — using sine and cosine waves at different frequencies. Position 1 gets one pattern, position 2 gets a different pattern, and so on.

Why sine waves? Because for any fixed offset \(k\), the encoding at position \(pos + k\) can be expressed as a linear function of the encoding at position \(pos\). This makes it easy for the model to learn relative positions — "this word is 3 positions after that one."

Embeddings & Weight Sharing

Words are converted to 512-dimensional vectors via learned embeddings. The paper uses a clever trick: the same weight matrix is shared between the input embedding, the output embedding, and the final prediction layer — reducing parameters while tying related representations together. Embedding weights are multiplied by \(\sqrt{d_{\text{model}}} = \sqrt{512} \approx 22.6\) to scale them appropriately.

Why Self-Attention Wins

The paper rigorously compared self-attention against the alternatives on three criteria. Here's the scorecard:

🎮 Compare Layer Types

Hover over each metric to understand what it means. \(n\) = sequence length, \(d\) = model dimension (512).

Layer Type	Complexity / Layer	Sequential Steps	Max Path Length
Self-Attention	\(O(n^2 \cdot d)\)	\(O(1)\) ✅	\(O(1)\) ✅
Recurrent (RNN)	\(O(n \cdot d^2)\)	\(O(n)\) ❌	\(O(n)\) ❌
Convolutional	\(O(k \cdot n \cdot d^2)\)	\(O(1)\) ✅	\(O(\log_k n)\) ⚠️
Restricted Self-Attn	\(O(r \cdot n \cdot d)\)	\(O(1)\) ✅	\(O(n/r)\) ⚠️

The key insight: for typical NLP tasks where sentence length \(n\) is smaller than the model dimension \(d\) (and it usually is — sentences of ~50 words vs. \(d=512\)), self-attention is both faster per layer and has the shortest path for learning long-range dependencies. A word can "talk to" any other word in just one step.

As a bonus, the paper notes that attention patterns are interpretable — you can actually visualize what each head has learned, something quite rare in deep learning.

The architecture was elegant. But did it actually work? The researchers put it to the test on the most competitive machine translation benchmarks in the field.

The Results: Faster, Better, Cheaper

The Transformer was tested on two major machine translation benchmarks — English→German and English→French — using the WMT 2014 dataset, the gold standard for translation research.

28.4

BLEU score
EN→DE (new record)

41.8

BLEU score
EN→FR (new record)

3.5

Days of training
(8 GPUs)

+2.0

BLEU improvement
over best ensemble (EN→DE)

What's a BLEU score? It measures how closely a machine translation matches human translations, from 0 (gibberish) to ~100 (perfect). In practice, scores above 30 for English→German are very good. An improvement of 2+ BLEU points is considered a major leap — typical papers claimed fractions of a point.

How the Transformer Compared to Everything Else

The chart below shows BLEU scores (translation quality) vs. training cost (computational expense). The Transformer is in the upper-left sweet spot: better AND cheaper.

Model	EN-DE BLEU	EN-FR BLEU	EN-DE FLOPs	EN-FR FLOPs
ByteNet	23.75	—	—	—
Deep-Att + PosUnk	—	39.2	—	1.0 × 10²⁰
GNMT + RL	24.6	39.92	2.3 × 10¹⁹	1.4 × 10²⁰
ConvS2S	25.16	40.46	9.6 × 10¹⁸	1.5 × 10²⁰
MoE	26.03	40.56	2.0 × 10¹⁹	1.2 × 10²⁰
GNMT + RL Ensemble	26.30	41.16	1.8 × 10²⁰	1.1 × 10²¹
ConvS2S Ensemble	26.36	41.29	7.7 × 10¹⁹	1.2 × 10²¹
Transformer (base)	27.3	38.1	3.3 × 10¹⁸	—
Transformer (big)	28.4	41.8	2.3 × 10¹⁹	—

Note: The base Transformer achieved 27.3 BLEU on EN→DE with only 3.3 × 10¹⁸ FLOPs — that's roughly 7× cheaper than ConvS2S and 55× cheaper than GNMT+RL Ensemble, while producing better results. The big model used 2.3 × 10¹⁹ FLOPs — still far cheaper than most competitors.

How It Was Trained

Great architectures need great training recipes. Here are the details that made the Transformer work:

📊 Data

EN→DE: 4.5M sentence pairs, ~37K shared BPE tokens
EN→FR: 36M sentence pairs, 32K word-piece tokens
Batches of ~25K source + 25K target tokens

⚡ Hardware

8 NVIDIA P100 GPUs (single machine)
Base model: 12 hours (100K steps, 0.4s/step)
Big model: 3.5 days (300K steps, 1.0s/step)

The "Warm-Up" Learning Rate

One subtle but important detail: the learning rate (how aggressively the model updates its weights) wasn't constant. It increased linearly for the first 4,000 steps, then gradually decayed.

Think of it like driving on an unfamiliar road. At first, you accelerate slowly because you don't know the terrain. Once you've got a feel for it, you speed up. Then you gradually ease off as you approach your destination to park precisely. The "warm-up" prevents the model from making wild, destructive updates before it has a stable sense of direction.

🎮 Explore the Learning Rate Schedule

Drag the slider to change the warm-up period and see how the learning rate curve changes.

Warmup steps: 4000

Regularization

Residual Dropout (P_drop = 0.1)

Randomly drops 10% of connections during training, like a team that practices with random players sitting out — it forces every part of the network to be useful, preventing over-reliance on any single pathway.

Label Smoothing (ε_ls = 0.1)

Instead of telling the model "this word is 100% correct," it says "this word is 90% likely the right answer." This makes the model slightly less confident but more accurate overall, improving BLEU score at the cost of perplexity.

What Matters Most? — The Ablation Study

The researchers systematically tweaked one thing at a time to understand which parts of the architecture actually matter. This is the scientific rigor that makes the paper trustworthy.

Finding: Not too few, not too many — 8 heads was the sweet spot.

A single attention head scored 24.9 BLEU (−0.9 from base). 16 heads matched the base at 25.8. But 32 heads dropped to 25.4. More heads means each head has fewer dimensions to work with (512/32 = 16), eventually hurting quality. The paper also found that reducing key dimension \(d_k\) independently (rows B) hurts quality, suggesting that the compatibility function needs enough capacity.

Finding: Bigger models are better — but with diminishing returns.

Models ranged from 28M to 213M parameters. Going from 2 layers (36M params, 23.7 BLEU) to 8 layers (80M params, 25.5 BLEU) was a big jump. The "big" model with \(d_{\text{model}}=1024\), \(d_{ff}=4096\), 16 heads, and 213M parameters achieved the best score of 26.4 BLEU on the dev set. But even smaller models were competitive — the base model (65M params) hit 25.8 BLEU.

Finding: Dropout is essential. Without it, the model overfits and performance drops.

Configuration	BLEU (dev)
No dropout (P_drop=0.0)	24.6
Base (P_drop=0.1)	25.8
High dropout (P_drop=0.2)	25.5

Removing dropout entirely cost 1.2 BLEU points. Label smoothing without dropout scored 25.3 — still below the base. The big model used even higher dropout (0.3) for English→French, suggesting larger models need more regularization.

Finding: Sinusoidal and learned positional encodings perform nearly identically.

Positional Encoding	BLEU (dev)
Sinusoidal (chosen)	25.8
Learned embeddings	25.7

Almost no difference! The authors chose sinusoidal because it might generalize to longer sequences than seen during training — the wave patterns naturally extend, while learned embeddings stop at the maximum training length.

Beyond Translation: English Constituency Parsing

A great architecture should work on more than one task. To test generality, the researchers applied the Transformer to English constituency parsing — breaking sentences into their grammatical tree structure.

Imagine diagramming a sentence in grammar class: "The cat sat on the mat" → [S [NP The cat] [VP sat [PP on [NP the mat]]]]. This is constituency parsing — a structured task quite different from translation, with strong grammatical constraints.

Using a 4-layer Transformer (smaller than the translation model) with minimal task-specific tuning, the results were impressive:

The Transformer hit 91.3 F1 in the WSJ-only setting — beating all previous models except the specialized Recurrent Neural Network Grammar (91.7). With semi-supervised training on 17M extra sentences, it reached 92.7 F1, outperforming all semi-supervised approaches. This proved the architecture wasn't just a translation trick — it was genuinely general.

What Changed Because of This Paper

"Attention Is All You Need" didn't just set a few records — it rewired the entire field of AI.

The end of RNNs as kings

Within 2 years of this paper, virtually every state-of-the-art NLP model switched from RNNs to Transformers. The sequential bottleneck was gone forever.

BERT, GPT, and the foundation model era

BERT (2018) used the Transformer's encoder. GPT (2018–present) used the decoder. Together, they launched the era of large language models — ChatGPT, Claude, Gemini, and thousands more are all Transformer descendants.

Far beyond text

The paper's authors predicted it: Transformers now power image generation (DALL-E, Stable Diffusion), protein structure prediction (AlphaFold), audio generation, video understanding, robotics, and more. The architecture proved as general as they hoped.

Democratized training

The massive parallelizability of the Transformer made it possible to train on modern GPU clusters efficiently. Without this, the scaling revolution — from millions to billions to trillions of parameters — simply wouldn't have been possible.

"We are excited about the future of attention-based models and plan to apply them to other tasks … involving input and output modalities other than text … such as images, audio and video."

— The authors, 2017. They were right about everything.

Key Takeaways

Here's what you could confidently explain to someone else after reading this:

1. Before the Transformer, AI processed language word-by-word (like reading through a mail slot). The Transformer lets the model see the entire sentence at once, using a mechanism called self-attention.

2. Self-attention works via Queries, Keys, and Values — each word asks "what should I pay attention to?" and gets a weighted blend of all other words' information. Multi-head attention runs 8 of these in parallel, each learning different relationships.

3. The architecture has an Encoder (reads input) and Decoder (generates output), each with 6 layers of attention + feed-forward networks. Positional encodings (sine waves) tell the model about word order.

4. Results were stunning: 28.4 BLEU on English→German (beating all models, including ensembles, by 2+ points) and 41.8 BLEU on English→French — trained in just 3.5 days on 8 GPUs. The base model was 7–55× cheaper to train than alternatives.

5. This paper's architecture became the foundation for virtually all modern AI — GPT, BERT, ChatGPT, DALL-E, AlphaFold, and thousands more. Its title wasn't just clever — attention really was all you needed.

The Model at a Glance
6 encoder layers · 6 decoder layers · 8 attention heads · d_model=512 · d_ff=2048 · 65M parameters (base) · 213M parameters (big)

What if a computer could read an entire sentence at once?

The World Before the Transformer

🐌 Painfully Slow

🧠 Forgetful

The Core Idea: Attention

🔍 Query (Q) — "What am I looking for?"

🏷️ Key (K) — "Here's my label"

📦 Value (V) — "Here's my actual content"

Scaled Dot-Product Attention

🎮 Try it: See Attention in Action

Multi-Head Attention: Eight Spotlights at Once

What different heads learn

The Full Transformer Architecture

Multi-Head Self-Attention

Feed-Forward Network

Masked Multi-Head Self-Attention

Encoder-Decoder Attention

Feed-Forward Network

Positional Encoding — Teaching Word Order

Embeddings & Weight Sharing

Why Self-Attention Wins

🎮 Compare Layer Types

The Results: Faster, Better, Cheaper

How the Transformer Compared to Everything Else

How It Was Trained

📊 Data

⚡ Hardware

The "Warm-Up" Learning Rate

🎮 Explore the Learning Rate Schedule

Regularization

Residual Dropout (Pdrop = 0.1)

Label Smoothing (εls = 0.1)

What Matters Most? — The Ablation Study

Beyond Translation: English Constituency Parsing

What Changed Because of This Paper

The end of RNNs as kings

BERT, GPT, and the foundation model era

Far beyond text

Democratized training

Key Takeaways

What if a computer could read
an entire sentence at once?

Residual Dropout (P_drop = 0.1)

Label Smoothing (ε_ls = 0.1)