Self-Improving AI

Darwin Gödel Machine

An AI system that rewrites its own code to become a better coder — inspired by Darwinian evolution and open-ended exploration.

20→50%
SWE-bench improvement
14→31%
Polyglot improvement
80
Evolution iterations
↓ Scroll to explore
The Big Idea

What if AI could improve itself?

The Darwin Gödel Machine (DGM) is a coding AI that edits its own source code to get better at coding — which also makes it better at editing its own source code.

🧬

Self-Referential

The same system that solves coding tasks also modifies its own codebase. Better coding skills = better self-improvement skills.

🌳

Open-Ended Evolution

Like biological evolution, it maintains an archive of diverse agent variants — "stepping stones" that may enable future breakthroughs.

🧪

Empirically Validated

Instead of proving changes are beneficial (theoretically impossible), it tests every modification on real coding benchmarks.

How It Works

The Evolution Loop

Each iteration: select a parent agent from the archive, have it modify its own code, evaluate the result, and add it to the growing family tree.

1
Select Parent
Pick an agent from the archive based on performance & novelty
2
Self-Modify
The agent analyzes its failures and edits its own source code
3
Evaluate
Test the new agent on coding benchmarks
4
Archive
Add viable agents to the growing evolutionary tree
Inspiration

Where Darwin meets Gödel

The DGM blends two powerful ideas: the Gödel Machine's dream of provably self-improving AI, and Darwinian evolution's strategy of diversity-driven exploration.

🔢 Original Gödel Machine

Requires mathematical proof before each change
Impractical for real-world systems
Single trajectory — can get stuck
Theoretically elegant

🧬 Darwin Gödel Machine

Validates changes empirically via benchmarks
Practical and demonstrates real gains
Population-based — explores many paths at once
Open-ended — stepping stones enable breakthroughs
Results

Performance Over Time

The DGM automatically discovers better tools and workflows, steadily improving its coding ability across 80 iterations.

SWE-bench: Best Agent Score Over Iterations
Polyglot: Best Agent Score Over Iterations
DGM vs. Baselines & Handcrafted Solutions
What It Discovered

Innovations Along the Way

The DGM independently invented techniques that human engineers also use — discovering them through automated self-improvement.

Non-Empty Patch Validation & Retry
Checks if a generated code patch actually modifies source files. If not, retries with a fresh approach.
Granular File Viewing (by line range)
Instead of loading entire files, view only specific line ranges — critical for large codebases that exceed context windows.
String-Replace Editing
Replaced whole-file overwrites with precise find-and-replace, reducing unintended side effects.
Auto-Summarize on Context Limit
When the conversation gets too long, automatically summarizes older messages to stay within the model's context window.
Multiple Patch Generations & Ranking
Generates several candidate solutions and uses a peer-review mechanism to select the best one.
History-Aware Attempts
Later attempts are informed by previous solutions and their test results, enabling iterative refinement.
The Archive

An Evolutionary Tree of AI Agents

Each node is a coding agent. Colors show performance. The path to the best agent (★) traverses lower-performing "stepping stones" — proving why open-ended search works.

🖱️ Hover over nodes to see details • ★ = best discovered agent
Generalization

Improvements Transfer Everywhere

The DGM's discoveries aren't overfitted — they generalize across different AI models, benchmarks, and programming languages.

🔄 Model Transfer

Searched with Claude 3.5 Sonnet, tested on others

19% 59%
Claude 3.7 Sonnet

📊 Benchmark Transfer

Trained on SWE-bench, tested on Polyglot

14.2% 28.9%
Zero-shot transfer

🌐 Language Transfer

Trained on Python, tested on C++, Rust, etc.

15.2% 30.8%
Non-Python tasks
Key Insight

Why Open-Ended Exploration Matters

Without maintaining a diverse archive of agent variants, the system gets stuck. The DGM's Darwinian approach dramatically outperforms hill-climbing.

📉

Without Self-Improvement

Using a fixed base agent to generate new variants: early gains taper off quickly. Max: ~39% on SWE-bench.

📈

Without Open-Ended Exploration

Self-improving but always from the latest agent (no archive): gets stuck at ~23% on SWE-bench after bad modifications.

🚀

Full DGM

Both self-improvement AND an evolutionary archive: reaches 50% on SWE-bench, matching human-designed state-of-the-art.

Responsible AI

Safety First

Self-improving AI raises important questions. The DGM incorporates multiple safeguards and the authors advocate for continued safety research.

🔒

Sandboxed Execution

All agents run in isolated environments with no access to the host system.

⏱️

Time Limits

Strict timeouts prevent resource exhaustion or unbounded behavior.

🔍

Full Traceability

Every modification is tracked in a tree structure, enabling human review.

📋

Scoped Modifications

Changes limited to improving performance on well-defined coding benchmarks.