An AI system that rewrites its own code to become a better coder — inspired by Darwinian evolution and open-ended exploration.
The Darwin Gödel Machine (DGM) is a coding AI that edits its own source code to get better at coding — which also makes it better at editing its own source code.
The same system that solves coding tasks also modifies its own codebase. Better coding skills = better self-improvement skills.
Like biological evolution, it maintains an archive of diverse agent variants — "stepping stones" that may enable future breakthroughs.
Instead of proving changes are beneficial (theoretically impossible), it tests every modification on real coding benchmarks.
Each iteration: select a parent agent from the archive, have it modify its own code, evaluate the result, and add it to the growing family tree.
The DGM blends two powerful ideas: the Gödel Machine's dream of provably self-improving AI, and Darwinian evolution's strategy of diversity-driven exploration.
The DGM automatically discovers better tools and workflows, steadily improving its coding ability across 80 iterations.
The DGM independently invented techniques that human engineers also use — discovering them through automated self-improvement.
Each node is a coding agent. Colors show performance. The path to the best agent (★) traverses lower-performing "stepping stones" — proving why open-ended search works.
The DGM's discoveries aren't overfitted — they generalize across different AI models, benchmarks, and programming languages.
Searched with Claude 3.5 Sonnet, tested on others
Trained on SWE-bench, tested on Polyglot
Trained on Python, tested on C++, Rust, etc.
Without maintaining a diverse archive of agent variants, the system gets stuck. The DGM's Darwinian approach dramatically outperforms hill-climbing.
Using a fixed base agent to generate new variants: early gains taper off quickly. Max: ~39% on SWE-bench.
Self-improving but always from the latest agent (no archive): gets stuck at ~23% on SWE-bench after bad modifications.
Both self-improvement AND an evolutionary archive: reaches 50% on SWE-bench, matching human-designed state-of-the-art.
Self-improving AI raises important questions. The DGM incorporates multiple safeguards and the authors advocate for continued safety research.
All agents run in isolated environments with no access to the host system.
Strict timeouts prevent resource exhaustion or unbounded behavior.
Every modification is tracked in a tree structure, enabling human review.
Changes limited to improving performance on well-defined coding benchmarks.