NIPS 2017 · Google Brain / Google Research

Attention Is All You Need

The paper that introduced the Transformer — a revolutionary neural network architecture that replaced recurrence with pure attention, forever changing AI.

Why did AI need a new architecture?

Before the Transformer, the best language models used Recurrent Neural Networks (RNNs). Think of RNNs like reading a book one word at a time, where you have to remember everything from previous words to understand the next one.

🐌

Painfully Sequential

RNNs process words one-by-one, like a single-lane highway. You can't start processing word 5 until you've finished words 1-4. This makes training very slow.

🧠

Forgetful Over Distance

By the time an RNN reaches word 50, it has often "forgotten" important details from word 1. Long-range dependencies are hard to learn.

Can't Use Modern GPUs Well

GPUs are built for massive parallelism — doing thousands of operations at once. Sequential processing wastes this power.

💡 The key insight: What if we could look at ALL words simultaneously, and let the model learn which words are important to each other? That's exactly what attention does.

What is Attention?

Attention is like a spotlight. When you're translating or understanding a sentence, not every word matters equally to every other word. Attention lets the model decide "which words should I focus on right now?"

Interactive: Click a word to see what it "attends" to

Click any word below to highlight its attention pattern

In the example above, when the model processes a word, it assigns attention weights to every other word. Higher weights mean "this word is more relevant to understanding me." This happens in parallel for all words — no sequential bottleneck!

Queries, Keys & Values

Attention works like a search engine inside the neural network. For every word, the model creates three vectors:

Q

Query — "What am I looking for?"

Each word generates a Query vector that represents what information it needs from other words.

K

Key — "What information do I contain?"

Each word also generates a Key vector that advertises what it has to offer.

V

Value — "Here's my actual content"

The Value vector contains the actual information that gets passed along when attention selects this word.

Attention(Q, K, V) = softmax( Q·Kᵀ / √dk ) · V

The Query asks a question, the Key-Query dot product measures relevance, and the Values provide the answer — weighted by how relevant each word is.

Eight Perspectives Are Better Than One

Instead of computing attention once, the Transformer does it 8 times in parallel with different learned projections. Each "head" can focus on a different type of relationship.

Hover over a head to see what it might learn to focus on.

💡 Why multi-head? One head might learn grammatical relationships (subject-verb), another might learn coreference (pronouns to nouns), and another might learn positional proximity. Together, they capture rich, nuanced understanding.

Inside the Transformer

The Transformer uses an encoder-decoder structure. The encoder reads the input, and the decoder generates the output. Both are stacks of 6 identical layers.

ENCODER Input Embedding + Positional Encoding ×6 Multi-Head Attention Feed Forward Add & Norm Encoder Output DECODER Output Embedding + Positional Encoding ×6 Masked Multi-Head Attn Multi-Head Attention Feed Forward Add & Norm (×3) Linear Softmax Output Probabilities

🔵 Encoder

Processes the full input sentence at once. Each of 6 layers applies self-attention (every word attends to every other word) then a feed-forward network. Residual connections and layer normalization keep gradients flowing.

🔴 Decoder

Generates output one token at a time. It has masked self-attention (can only look at earlier words), plus cross-attention to the encoder's output. Also 6 layers deep.

Teaching Position Without Recurrence

Since the Transformer processes all words simultaneously, it has no built-in sense of word order. The solution? Add a unique positional signal to each word's embedding using sine and cosine waves of varying frequencies.

Each row = a position, each column = a dimension. Colors show sine/cosine values. The pattern is unique for every position, giving the model order information.

PE(pos, 2i) = sin(pos / 100002i/d)   |   PE(pos, 2i+1) = cos(pos / 100002i/d)

Why Self-Attention Wins

The paper compares self-attention to recurrent and convolutional layers across three critical dimensions:

Maximum path length measures how many steps a signal must travel between any two words. Self-attention connects every word to every other word in one step (O(1)), while RNNs need O(n) steps — making it much harder to learn relationships between distant words.

Sequential operations measures the minimum steps that must happen one-after-another (can't be parallelized). Self-attention needs only O(1) sequential operations, whereas RNNs need O(n) — meaning Transformers can fully exploit parallel GPU hardware.

Computational complexity per layer: Self-attention is O(n²·d) while recurrent layers are O(n·d²). For typical sentence lengths where n < d (e.g., n≈50, d=512), self-attention is actually cheaper.

Record-Breaking Performance

The Transformer achieved state-of-the-art results on machine translation while training dramatically faster than competing models.

28.4 BLEU

English→German: Exceeded the best ensemble models by +2 BLEU. A huge jump.

41.0 BLEU

English→French: New single-model state-of-the-art, trained in just 3.5 days on 8 GPUs.

~10× cheaper to train

The base Transformer used 3.3×10¹⁸ FLOPs vs 1-2×10²⁰ for comparable models. Massively more efficient.

The Warmup Learning Rate

The paper introduced a distinctive learning rate schedule: linearly increase for the first 4,000 steps, then decay proportionally to the inverse square root of the step number. This "warmup" prevents early training instability.

lr = dmodel-0.5 · min(step-0.5, step · warmup_steps-1.5)

What Matters Most?

The authors systematically varied components to understand what drives performance:

Number of Heads

8 heads is the sweet spot. Single-head loses 0.9 BLEU; 32 heads also degrades slightly.

Key Dimension (dk)

Reducing dk hurts quality, suggesting that computing query-key compatibility needs sufficient capacity.

Model Size

Bigger models = better results. Going from dmodel=256 to 1024 improved BLEU from 24.5 to 26.0.

Dropout

Critical for regularization. Without dropout, the model overfits and loses ~1.2 BLEU.

Positional Encoding

Learned vs. sinusoidal embeddings perform nearly identically, but sinusoidal may generalize to longer sequences.

Why This Paper Changed Everything

With over 100,000 citations, "Attention Is All You Need" is arguably the most influential AI paper of the decade. The Transformer architecture became the foundation for:

🤖

GPT Series

OpenAI's GPT-1 through GPT-4 are decoder-only Transformers. ChatGPT, DALL-E, and Codex all descend from this paper.

📚

BERT & Beyond

Google's BERT uses the encoder side to revolutionize search, question answering, and NLP tasks.

🎨

Vision Transformers

ViT, DINO, and Stable Diffusion apply Transformers to images, proving the architecture transcends text.

🧬

Scientific Discovery

AlphaFold 2 uses Transformers to predict protein structures. The architecture now powers breakthroughs across science.