NIPS 2017 — Landmark Paper

Attention Is
All You Need

The paper that introduced the Transformer — the architecture behind GPT, BERT, and virtually every modern AI system. No recurrence, no convolution — just attention.

The Problem

Why Did We Need a New Architecture?

Before the Transformer, language models relied on recurrent neural networks (RNNs) that processed words one at a time — like reading a book one letter at a time, never being able to skip ahead.

🐌

Slow & Sequential

RNNs must process tokens one after another. You can't parallelize them — each step depends on the previous one. This makes training painfully slow.

🧠

Forgetful

RNNs struggle with long-range dependencies. By the time they reach the end of a long sentence, they've often "forgotten" the beginning.

The Transformer Fix

The Transformer processes all words simultaneously using "attention" — letting every word directly look at every other word, no matter how far apart.

The Core Idea

What is Attention?

Imagine reading a sentence. When you see the word "it," your brain instantly looks back to figure out what "it" refers to. That's attention — letting each word "look at" other words to understand context.

Self-Attention means each word in a sentence computes how relevant every other word is to it. The word "bank" might attend strongly to "river" (riverbank) or "money" (financial bank) — the attention mechanism figures out the right context automatically.

🔮 Interactive Attention Demo

Click any word to see what it "attends to" — thicker lines mean stronger attention.

Source Sentence (English)
Target Sentence (German)
The Math

Scaled Dot-Product Attention

The core computation is elegant and simple. For each word, we compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I give out?).

Attention(Q, K, V) = softmax( Q·Kᵀ / √dk ) · V
1
Q · Kᵀ
Compare each query to all keys using dot product
2
Scale
Divide by √dk to prevent exploding values
3
Mask
(Optional) Block future positions in decoder
4
Softmax
Convert scores to probabilities (0–1)
5
× V
Weighted sum of values = output
Multiple Perspectives

Multi-Head Attention

Instead of computing attention once, the Transformer does it 8 times in parallel — each "head" can learn to focus on different types of relationships (grammar, meaning, position, etc.).

Each head learns a different attention pattern. Click to see simulated patterns.

With 8 heads and dmodel = 512, each head works on 64 dimensions. The results are concatenated and projected back to 512 dimensions. The total computation is the same as single-head attention with full dimensionality — but the model learns richer representations!

Architecture

The Full Transformer

The Transformer follows an encoder-decoder pattern. The encoder reads the input, and the decoder generates the output one token at a time.

🔵 Encoder (×6 layers)

Each layer has two sub-layers:

  1. Multi-Head Self-Attention — every input word looks at every other input word
  2. Feed-Forward Network — two linear layers with ReLU, applied to each position independently

Each sub-layer uses a residual connection (x + sublayer(x)) followed by layer normalization.

🔴 Decoder (×6 layers)

Each layer has three sub-layers:

  1. Masked Multi-Head Self-Attention — output words attend only to earlier positions (no peeking!)
  2. Encoder-Decoder Attention — decoder queries attend to encoder outputs
  3. Feed-Forward Network — same as encoder
Input
Embeddings
+ Positional
Encoding
×6 Layers
Encoder
Stack
×6 Layers
Decoder
Stack
Output
Linear +
Softmax

dmodel = 512 everywhere · 8 attention heads · Feed-forward inner dimension = 2048

Position Matters

Positional Encoding

Since the Transformer has no recurrence (no "time steps"), it has no idea what order the words are in! The solution: add sinusoidal position signals to the input embeddings.

PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)

Visualization of positional encodings — each row is a position, each column a dimension. Colors represent sine/cosine values (hover for details).

Why It Works

Self-Attention vs. The Competition

Self-attention connects any two positions in O(1) operations — recurrence needs O(n) steps, and convolution needs O(log n) layers.

Layer Type Complexity / Layer Sequential Ops Max Path Length
⭐ Self-Attention O(n² · d) O(1) O(1)
Recurrent (RNN) O(n · d²) O(n) O(n)
Convolutional O(k · n · d²) O(1) O(logk(n))

Key insight: The maximum path length for self-attention is O(1) — meaning any two words can directly communicate in a single layer. For RNNs, a signal must pass through O(n) steps, making long-range dependencies much harder to learn.

Training Details

The Warmup Learning Rate

The paper introduced a distinctive learning rate schedule: linearly increase for the first 4,000 steps, then decay proportional to the inverse square root of the step number.

Results

State-of-the-Art Translation

The Transformer demolished existing benchmarks on machine translation — achieving higher quality while being dramatically faster and cheaper to train.

English → German (BLEU Score ↑)

28.4 BLEU

English→German score, beating ensembles of other models by +2.0 BLEU

41.0 BLEU

English→French score, new single-model state-of-the-art

3.5 Days

Training time on 8 GPUs — a fraction of the cost of competing models

Ablations

What Matters Most?

The authors systematically varied components to understand what matters. Here's what they found:

🎯

8 Heads is Sweet Spot

Single-head attention drops 0.9 BLEU. Too many heads (32) also hurts. 8 heads strikes the right balance.

📏

Bigger is Better

Increasing dmodel from 512→1024 and dff from 2048→4096 improved results, at the cost of more parameters.

🛡️

Dropout is Critical

Removing dropout hurts significantly. The big model uses Pdrop = 0.3 and label smoothing ε = 0.1.

🌊

Sinusoidal ≈ Learned

Sinusoidal positional encodings perform nearly identically to learned ones — and can generalize to longer sequences.

Impact

Why This Paper Changed Everything

Published in 2017, "Attention Is All You Need" didn't just improve translation — it laid the foundation for the entire modern AI revolution.

🤖 GPT Series

OpenAI's GPT-1 through GPT-4 are all based on the Transformer decoder. ChatGPT? Pure Transformer.

📖 BERT

Google's BERT uses the Transformer encoder for understanding language — powering search and NLP worldwide.

🎨 Beyond Text

Vision Transformers (ViT), DALL-E, AlphaFold — the architecture spread to images, proteins, music, and more.