Self-Improving AI

Darwin Gödel Machine

An AI system that rewrites its own code to become a better coder — inspired by Darwinian evolution and open-ended exploration.

20→50%

SWE-bench improvement

14→31%

Polyglot improvement

Evolution iterations

↓ Scroll to explore

The Big Idea

What if AI could improve itself?

The Darwin Gödel Machine (DGM) is a coding AI that edits its own source code to get better at coding — which also makes it better at editing its own source code.

🧬

Self-Referential

The same system that solves coding tasks also modifies its own codebase. Better coding skills = better self-improvement skills.

🌳

Open-Ended Evolution

Like biological evolution, it maintains an archive of diverse agent variants — "stepping stones" that may enable future breakthroughs.

🧪

Empirically Validated

Instead of proving changes are beneficial (theoretically impossible), it tests every modification on real coding benchmarks.

How It Works

The Evolution Loop

Each iteration: select a parent agent from the archive, have it modify its own code, evaluate the result, and add it to the growing family tree.

Select Parent

Pick an agent from the archive based on performance & novelty

→

Self-Modify

The agent analyzes its failures and edits its own source code

→

Evaluate

Test the new agent on coding benchmarks

→

Where Darwin meets Gödel

The DGM blends two powerful ideas: the Gödel Machine's dream of provably self-improving AI, and Darwinian evolution's strategy of diversity-driven exploration.

🔢 Original Gödel Machine

Requires mathematical proof before each change

Impractical for real-world systems

Single trajectory — can get stuck

Theoretically elegant

🧬 Darwin Gödel Machine
 Validates changes empirically via benchmarks
 Practical and demonstrates real gains
 Population-based — explores many paths at once
 Open-ended — stepping stones enable breakthroughs

Results

Performance Over Time

The DGM automatically discovers better tools and workflows, steadily improving its coding ability across 80 iterations.

SWE-bench: Best Agent Score Over Iterations

Polyglot: Best Agent Score Over Iterations

DGM vs. Baselines & Handcrafted Solutions

What It Discovered

Innovations Along the Way

The DGM independently invented techniques that human engineers also use — discovering them through automated self-improvement.

Non-Empty Patch Validation & Retry

Checks if a generated code patch actually modifies source files. If not, retries with a fresh approach.

Granular File Viewing (by line range)

Instead of loading entire files, view only specific line ranges — critical for large codebases that exceed context windows.

String-Replace Editing

Replaced whole-file overwrites with precise find-and-replace, reducing unintended side effects.

Auto-Summarize on Context Limit

When the conversation gets too long, automatically summarizes older messages to stay within the model's context window.

Multiple Patch Generations & Ranking

Generates several candidate solutions and uses a peer-review mechanism to select the best one.

History-Aware Attempts

Later attempts are informed by previous solutions and their test results, enabling iterative refinement.

The Archive

An Evolutionary Tree of AI Agents

Each node is a coding agent. Colors show performance. The path to the best agent (★) traverses lower-performing "stepping stones" — proving why open-ended search works.

🖱️ Hover over nodes to see details • ★ = best discovered agent

Generalization

Improvements Transfer Everywhere

The DGM's discoveries aren't overfitted — they generalize across different AI models, benchmarks, and programming languages.

🔄 Model Transfer

Searched with Claude 3.5 Sonnet, tested on others

19% → 59%

Claude 3.7 Sonnet

📊 Benchmark Transfer

Trained on SWE-bench, tested on Polyglot

14.2% → 28.9%

Zero-shot transfer

🌐 Language Transfer

Trained on Python, tested on C++, Rust, etc.

15.2% → 30.8%

Non-Python tasks

Key Insight

Why Open-Ended Exploration Matters

Without maintaining a diverse archive of agent variants, the system gets stuck. The DGM's Darwinian approach dramatically outperforms hill-climbing.

📉

Without Self-Improvement

Using a fixed base agent to generate new variants: early gains taper off quickly. Max: ~39% on SWE-bench.

📈

Without Open-Ended Exploration

Self-improving but always from the latest agent (no archive): gets stuck at ~23% on SWE-bench after bad modifications.

🚀

Full DGM

Both self-improvement AND an evolutionary archive: reaches 50% on SWE-bench, matching human-designed state-of-the-art.

Responsible AI

Safety First

Self-improving AI raises important questions. The DGM incorporates multiple safeguards and the authors advocate for continued safety research.

🔒

Sandboxed Execution

All agents run in isolated environments with no access to the host system.

⏱️

Time Limits

Strict timeouts prevent resource exhaustion or unbounded behavior.

🔍

Full Traceability

Every modification is tracked in a tree structure, enabling human review.

📋

Scoped Modifications

Changes limited to improving performance on well-defined coding benchmarks.