๐Ÿ“„ NIPS 2017 ยท The paper that changed AI

Attention Is All You Need

How a revolutionary architecture called the Transformer replaced sequential processing with pure attention โ€” and launched the modern AI era.

28.4
BLEU Score (ENโ†’DE)
41.8
BLEU Score (ENโ†’FR)
3.5
Days to Train
8
Authors
Scroll to explore โ†“

Why did AI need a new architecture?

Before the Transformer, language AI was stuck processing words one at a time โ€” like reading a book by looking through a keyhole, moving it one word at a time.

๐Ÿ”—

Sequential Bottleneck

Recurrent Neural Networks (RNNs) processed words one after another. To understand the 100th word, you had to wait for all 99 before it. This made training painfully slow.

๐Ÿง 

Forgetting Problem

By the time an RNN reached the end of a long sentence, it had often "forgotten" important information from the beginning โ€” like a game of telephone.

โฑ๏ธ

Slow Training

Because each step depended on the previous one, you couldn't use modern GPUs efficiently. The sequential nature was fundamentally at odds with parallel hardware.

What if we could look at everything at once?

The Transformer's key insight: instead of reading words one-by-one, let every word look at every other word simultaneously. This is the self-attention mechanism.

๐ŸŒ

RNN (Before)

Process words sequentially โ€” one after another

Max Path Length
O(n)
Sequential Ops
O(n)
๐Ÿƒ

CNN

Use sliding windows โ€” limited view of context

Max Path Length
O(log n)
Sequential Ops
O(1)
โšก

Self-Attention

Every word sees every other word โ€” instant connections

Max Path Length
O(1)
Sequential Ops
O(1)

How Self-Attention Works

Each word creates three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?).

Q
Query
ร—
K
Key
โ†’
scores
โ†“ softmax
ร—
V
Value
โ†’
Output
Attention(Q, K, V) = softmax( Q ยท KT / โˆšdk ) ยท V

Interactive Attention Demo

Click a word to see what it "pays attention" to. Brighter = more attention.

Eight Heads Are Better Than One

Instead of computing attention once, the Transformer splits it into 8 parallel "heads" โ€” each learning to focus on different types of relationships.

๐Ÿ” Syntactic Heads

Some heads learn grammatical structure โ€” connecting subjects to verbs, or tracking sentence clauses.

๐Ÿ”— Reference Heads

Other heads resolve references โ€” figuring out that "its" in "its application" refers to "The Law".

๐Ÿ“ Position Heads

Some heads attend primarily to nearby words, learning local phrase structure and word order.

Inside the Transformer

The full architecture has an Encoder (reads the input) and a Decoder (generates the output), each made of 6 identical layers stacked on top of each other.

๐Ÿ“ Input Embedding + Positional Encoding
Convert words to vectors, add position info
๐Ÿ”ฎ Self-Attention Layer
Every position attends to every other position
โš™๏ธ Feed-Forward Network
Two linear layers with ReLU activation
โž• Add & Normalize
Residual connections + layer normalization
๐Ÿ”„ Encoder-Decoder Attention
Decoder attends to encoder output
๐Ÿ“ค Output Layer
Linear projection + Softmax โ†’ next word

Click a component to explore

Each part of the Transformer plays a specific role. Click any block on the left to learn what it does and why it matters.

How does it know word order?

Since attention looks at everything simultaneously, the model has no inherent sense of order. Sinusoidal positional encodings are added to give each position a unique fingerprint.

PE(pos, 2i) = sin(pos / 100002i/d)   |   PE(pos, 2i+1) = cos(pos / 100002i/d)

Positional Encoding Heatmap

Each row is a position (0-49), each column is a dimension. The wave-like patterns give each position a unique signature.

Crushing the Competition

The Transformer didn't just match previous models โ€” it blew past them while training faster and cheaper.

BLEU Scores: English โ†’ German Translation

Higher is better. The Transformer beat even ensembles of older models.

10-40ร—

Less training compute than competing models

3.5

Days to train on 8 GPUs โ€” vs. weeks for competitors

+2.0

BLEU points better than previous best (including ensembles)

What made it work?

Scaled Dot-Product Attention

Dividing by โˆšdk prevents dot products from growing too large, keeping gradients stable during training.

Multi-Head Attention (h=8)

Running 8 attention heads in parallel โ€” each with 64 dimensions instead of one with 512 โ€” captures diverse relationships at similar cost.

Positional Encoding

Sinusoidal functions encode position, allowing the model to generalize to sequences longer than those seen during training.

Residual Connections + Layer Norm

Skip connections around every sub-layer prevent the vanishing gradient problem in deep stacks of 6 layers.

Warmup Learning Rate Schedule

Linearly increasing the learning rate for 4,000 steps, then decaying โ€” a recipe that became standard in modern AI.

What happens when you change things?

The authors carefully tested what each component contributes. Here's what they found:

Impact of Number of Attention Heads

BLEU score on ENโ†’DE dev set. Sweet spot at 8 heads โ€” too few or too many hurts.

Impact of Model Size

Number of layers (N) vs. BLEU score. Bigger models consistently perform better.

The paper that launched a revolution

The Transformer architecture became the foundation for virtually all modern AI systems.

๐Ÿค– GPT Series

OpenAI's GPT-1 through GPT-4 are all based on the Transformer decoder. ChatGPT descends directly from this paper.

๐Ÿงช BERT

Google's BERT used the Transformer encoder for bidirectional language understanding, revolutionizing NLP.

๐ŸŽจ Diffusion Models

Image generators like DALL-E and Stable Diffusion use Transformer-based architectures for visual creation.

๐Ÿงฌ AlphaFold

DeepMind's protein structure prediction uses attention mechanisms inspired by this work to solve biology.

๐ŸŽต Audio & Music

Whisper (speech recognition), MusicLM, and other audio models all use Transformer architectures.

๐ŸŒ 100k+ Citations

One of the most cited papers in all of computer science, fundamentally reshaping the field of artificial intelligence.