📄 NIPS 2017 · The paper that changed AI

Attention Is All You Need

How a revolutionary architecture called the Transformer replaced sequential processing with pure attention — and launched the modern AI era.

28.4

BLEU Score (EN→DE)

41.8

BLEU Score (EN→FR)

3.5

Days to Train

Authors

Scroll to explore ↓

The Problem

Why did AI need a new architecture?

Before the Transformer, language AI was stuck processing words one at a time — like reading a book by looking through a keyhole, moving it one word at a time.

🔗

Sequential Bottleneck

Recurrent Neural Networks (RNNs) processed words one after another. To understand the 100th word, you had to wait for all 99 before it. This made training painfully slow.

🧠

Forgetting Problem

By the time an RNN reached the end of a long sentence, it had often "forgotten" important information from the beginning — like a game of telephone.

⏱️

Slow Training

Because each step depended on the previous one, you couldn't use modern GPUs efficiently. The sequential nature was fundamentally at odds with parallel hardware.

The Big Idea

What if we could look at everything at once?

The Transformer's key insight: instead of reading words one-by-one, let every word look at every other word simultaneously. This is the self-attention mechanism.

🐌

RNN (Before)

Process words sequentially — one after another

Max Path Length

O(n)

Sequential Ops

O(n)

🏃

CNN

Use sliding windows — limited view of context

Max Path Length

O(log n)

Sequential Ops

O(1)

⚡

Self-Attention

Every word sees every other word — instant connections

Max Path Length

O(1)

Sequential Ops

O(1)

Core Mechanism

How Self-Attention Works

Each word creates three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?).

Q
Query

K
Key

→

scores
↓ softmax

V
Value

→

Output

Attention(Q, K, V) = softmax( Q · K^T / √d_k ) · V

Interactive Attention Demo

Click a word to see what it "pays attention" to. Brighter = more attention.

Multi-Head Attention

Eight Heads Are Better Than One

Instead of computing attention once, the Transformer splits it into 8 parallel "heads" — each learning to focus on different types of relationships.

🔍 Syntactic Heads

Some heads learn grammatical structure — connecting subjects to verbs, or tracking sentence clauses.

🔗 Reference Heads

Other heads resolve references — figuring out that "its" in "its application" refers to "The Law".

📏 Position Heads

Some heads attend primarily to nearby words, learning local phrase structure and word order.

Architecture

Inside the Transformer

The full architecture has an Encoder (reads the input) and a Decoder (generates the output), each made of 6 identical layers stacked on top of each other.

📝 Input Embedding + Positional Encoding

Convert words to vectors, add position info

🔮 Self-Attention Layer

Every position attends to every other position

⚙️ Feed-Forward Network

Two linear layers with ReLU activation

➕ Add & Normalize

Residual connections + layer normalization

🔄 Encoder-Decoder Attention

Decoder attends to encoder output

📤 Output Layer

Linear projection + Softmax → next word

Click a component to explore

Each part of the Transformer plays a specific role. Click any block on the left to learn what it does and why it matters.

Positional Encoding

How does it know word order?

Since attention looks at everything simultaneously, the model has no inherent sense of order. Sinusoidal positional encodings are added to give each position a unique fingerprint.

PE_{(pos, 2i)} = sin(pos / 10000^2i/d) | PE_{(pos, 2i+1)} = cos(pos / 10000^2i/d)

Positional Encoding Heatmap

Each row is a position (0-49), each column is a dimension. The wave-like patterns give each position a unique signature.

Results

Crushing the Competition

The Transformer didn't just match previous models — it blew past them while training faster and cheaper.

BLEU Scores: English → German Translation

Higher is better. The Transformer beat even ensembles of older models.

10-40×

Less training compute than competing models

3.5

Days to train on 8 GPUs — vs. weeks for competitors

+2.0

BLEU points better than previous best (including ensembles)

Key Innovations

What made it work?

Scaled Dot-Product Attention

Dividing by √d_k prevents dot products from growing too large, keeping gradients stable during training.

Multi-Head Attention (h=8)

Running 8 attention heads in parallel — each with 64 dimensions instead of one with 512 — captures diverse relationships at similar cost.

Positional Encoding

Sinusoidal functions encode position, allowing the model to generalize to sequences longer than those seen during training.

Residual Connections + Layer Norm

Skip connections around every sub-layer prevent the vanishing gradient problem in deep stacks of 6 layers.

Warmup Learning Rate Schedule

Linearly increasing the learning rate for 4,000 steps, then decaying — a recipe that became standard in modern AI.

Ablation Study

What happens when you change things?

The authors carefully tested what each component contributes. Here's what they found:

Impact of Number of Attention Heads

BLEU score on EN→DE dev set. Sweet spot at 8 heads — too few or too many hurts.

Impact of Model Size

Number of layers (N) vs. BLEU score. Bigger models consistently perform better.

Legacy

The paper that launched a revolution

The Transformer architecture became the foundation for virtually all modern AI systems.

🤖 GPT Series

OpenAI's GPT-1 through GPT-4 are all based on the Transformer decoder. ChatGPT descends directly from this paper.

🧪 BERT

Google's BERT used the Transformer encoder for bidirectional language understanding, revolutionizing NLP.

🎨 Diffusion Models

Image generators like DALL-E and Stable Diffusion use Transformer-based architectures for visual creation.

🧬 AlphaFold

DeepMind's protein structure prediction uses attention mechanisms inspired by this work to solve biology.

🎵 Audio & Music

Whisper (speech recognition), MusicLM, and other audio models all use Transformer architectures.

🌍 100k+ Citations

One of the most cited papers in all of computer science, fundamentally reshaping the field of artificial intelligence.