What if AI could learn to
get better at getting better?

HyperAgents are self-improving AI systems that don't just search for better solutions — they continually improve how they search.

Zhang, Zhao, Yang, Foerster, Clune, Jiang, Devlin & Shavrina  |  UBC, Vector Institute, Edinburgh, NYU, Meta

The Ceiling Problem in Self-Improving AI

Imagine you're learning to cook. You follow a recipe, taste the result, and adjust. Over time, you get better at that dish. But what if you could also upgrade the recipe book itself — refine your tasting technique, invent new cooking methods, even learn how to learn from failures faster? That's the difference between regular improvement and meta-improvement.

Most AI systems that "improve themselves" are stuck at the first level. They have a fixed improvement strategy — like always following the same recipe book. The strategy was designed by a human, and the AI can never change it. This creates a hard ceiling on progress.

Before: Fixed Improvement

A human designs the improvement recipe. The AI follows it exactly, forever. If the recipe has blind spots, the AI can never overcome them.

After: Self-Improving Improvement

The AI can rewrite its own improvement recipe. It notices what works, invents new strategies, and builds better tools for future upgrades.

This paper introduces hyperagents — AI programs that can modify every part of themselves, including the part responsible for deciding how to modify themselves. The result? AI that improves its ability to improve, across any domain — not just coding.

Three Ideas You Need First

Before we get to the breakthrough, let's build up three key concepts, one at a time.

1. Task Agent vs. Meta Agent

Think of a restaurant. The task agent is the chef — it cooks the food (solves the actual problem). The meta agent is the restaurant manager — it decides how to train the chef, what recipes to try, and how to restructure the kitchen. In AI, the task agent does the work; the meta agent modifies the system to do work better.

2. Open-Ended Exploration (The Archive)

Imagine evolution in nature. It doesn't just keep the single "best" creature — it maintains a diverse population where different traits can recombine in surprising ways. Similarly, the system here maintains an archive — a growing library of agent variants. New agents are created by selecting promising "parents" from the archive and modifying them. Some variants are stepping stones: they're not great on their own, but their descendants become excellent.

3. The Darwin Gödel Machine (DGM)

The Darwin Gödel Machine (DGM) was a prior system that combined self-modification with open-ended exploration — but only for coding tasks. It worked because the thing the agent was evaluated on (writing code) was the same skill it needed to improve itself (also writing code). But what if you want an AI that reviews research papers? Getting better at reviewing papers doesn't automatically make the AI better at rewriting its own code. The skills don't align. That's the gap hyperagents fill.

The "Aha" — Merging Task & Meta Into One Editable Program

Here's the key insight: put the chef and the manager in the same editable document. When the AI modifies itself, it can change how it solves tasks and how it decides what changes to make. This is called metacognitive self-modification — improving the improvement process itself.

"A hyperagent combines the task agent and the meta agent into a single self-referential, modifiable program, such that the mechanism responsible for generating improvements is itself subject to modification."
1

Start Simple

The system begins with a bare-bones hyperagent: a task agent that just calls an AI model once, and a meta agent that simply says "modify any part of the codebase." No clever tricks.

2

Self-Modify

A parent hyperagent is selected from the archive. Its meta agent reads the code, reviews past performance, and proposes modifications — to any part: the task agent, the meta agent, or both.

3

Evaluate

The new hyperagent is tested on real tasks (coding problems, paper reviews, robotics challenges, math grading). Its performance is recorded.

4

Add to Archive & Repeat

The new variant joins the archive. Better-performing agents are more likely to be selected as parents. Over time, the archive becomes a rich landscape of increasingly capable agents — and increasingly capable improvers.

Does It Actually Work? Across Four Very Different Domains

The researchers tested DGM-Hyperagents (DGM-H) on four wildly different tasks, demonstrating that the approach isn't limited to one domain. Each experiment ran 5 times with medians and 95% confidence intervals reported.

Polyglot Coding Benchmark

Task: Given a code repository and a natural-language instruction, modify the code correctly. Tested with pass@1 (no peeking at test results).

Why it matters: This is the DGM's home turf. Can DGM-H match a system custom-built for coding?

Result: DGM-H improved from 0.084 → 0.267 on the full benchmark (165 unseen tasks). The original DGM achieved 0.142 → 0.307. DGM-H reaches comparable performance despite not being designed for coding.

Training subset: 60 tasks. Full test: 165 tasks. FM: Claude 3.5 Sonnet (modification) + o3-mini (evaluation). 80 iterations.

AI Paper Review

Task: Read a full AI research paper and predict accept/reject, matching real conference decisions from ICLR 2024/2025 and NeurIPS 2023/2024.

Why it matters: This is a subjective task with no objective ground truth — a hard test for self-improvement.

Result: DGM-H improved from 0.0 → 0.710 on test tasks. The initial agent scored 0.0 (couldn't even format outputs correctly). The static baseline (AI-Scientist-v2) scored 0.630. DGM-H surpassed it.

100 train / 100 validation / 100 test tasks. FM: Claude 4.5 Sonnet (modification) + GPT-4o (evaluation). 100 iterations.

Robotics Reward Design

Task: Design a reward function so a simulated quadruped robot learns a specific behavior (walking forward during training, maximizing torso height at test).

Why it matters: This goes beyond language — agents must interact with physics simulators and reinforcement learning.

Result: DGM-H improved from 0.060 → 0.372 on test tasks, surpassing the default reward function (0.348). Crucially, DGM-H agents learned to make the robot jump rather than just stand tall — escaping a local optimum that trapped other methods.

6 train / 6 test evaluations. FM: Claude 4.5 Sonnet for both. 100 iterations.

Olympiad-Level Math Grading

Task: Grade solutions to International Mathematical Olympiad problems (scores: 0, 1, 6, or 7 out of 7), matching expert human grades.

Why it matters: This domain was held out as a meta-evaluation — testing whether improvements learned elsewhere transfer here.

Result: With transfer from other domains + continued self-improvement, DGM-H improved ProofAutoGrader's accuracy from 0.561 → 0.601 on the full IMO-GradingBench, lowering mean absolute error from 0.178 to 0.175.

100 train / 100 validation / 100 test tasks. FM: Claude 4.5 Sonnet (modification) + o4-mini (evaluation). 200 iterations.

Test Performance: DGM-H vs. Baselines (Paper Review & Robotics)

Ablation: Every Component Matters

The researchers systematically removed components to prove each one is necessary:

Method Paper Review (Test) Robotics (Test)
Initial Agent 0.0 0.060
DGM (original, coding-specific) 0.0 0.0
DGM-custom (manually adapted) 0.590 0.348
DGM-H w/o self-improve 0.0 0.213
DGM-H w/o open-ended exploration 0.0 0.116
DGM-H (full system) 0.710 0.372

DGM-H significantly outperforms all ablations (p < 0.05, Wilcoxon signed-rank test). Medians over 5 runs with 95% bootstrap CIs.

It Doesn't Just Get Better — It Gets Better at Getting Better

This is the most remarkable finding. The researchers didn't just show that DGM-H improves task performance — they showed it improves its ability to generate improved agents. And these meta-improvements transfer across domains.

The Transfer Experiment

Think of it like this: you learn how to study effectively while preparing for history exams. Then you apply those study skills to chemistry — and they still work, even though the subjects are totally different. That's what happens here.

improvement@50: Can transferred agents generate improvements in a new domain?

What the numbers mean

imp@50 measures how much a meta agent can improve a task agent within 50 iterations. The initial meta agent achieves imp@50 ≈ 0 on Olympiad math grading — it can't generate better agents. Transfer agents from the original DGM also achieve ≈ 0. But transfer hyperagents from DGM-H achieve imp@50 = 0.630 — a massive leap, in a domain they were never trained on.

What Did It Actually Learn?

The researchers found that DGM-H autonomously developed several general-purpose meta-capabilities — none of which were explicitly requested:

The hyperagent invented its own memory system — storing not just performance numbers, but synthesized insights: "Gen55 has best accuracy but is too harsh. Gen64 improved balance but lower accuracy. Need to combine gen55's critical reasoning with gen64's balance." Later generations consult this memory to avoid repeating past mistakes.

The system created a PerformanceTracker class that logs scores across generations, computes moving averages, and identifies improvement trends — enabling data-driven decisions about what to change next.

In paper review, the system detected when its predictions collapsed (e.g., accepting 99% of papers). It automatically tracked label distributions, flagged bias when one class exceeded 80%, and surfaced corrective suggestions to the meta agent. After detection and correction, accuracy jumped from 49% to 63%.

The hyperagent learned to allocate its compute budget strategically: making bold architectural changes early (when there are many iterations left) and conservative fine-tuning late. This mirrors human intuition about exploration vs. exploitation, but was discovered autonomously.

Compounding Across Runs

Improvements don't just transfer — they compound. Starting DGM-H from a transferred hyperagent on Olympiad math grading yields higher final performance (0.640) than starting from scratch (0.610). Adding ProofAutoGrader as the initial task agent pushes it to 0.700, surpassing ProofAutoGrader's own score of 0.670.

Compounding: IMO Math Grading Test Performance

Explore: How Self-Improvement Scales

Drag the slider to see how DGM-H performance changes as iterations increase. Notice how the system keeps finding improvements even at later stages — evidence that it's not hitting a ceiling.

Number of self-improvement iterations (Paper Review + Robotics)
0255075100
Paper Review (Training Score)
Robotics (Training Score)

What About the Risks?

A system that can rewrite its own improvement process raises real concerns. The researchers were proactive about this.

Precautions Taken

All agent-generated code ran in sandboxed environments with strict resource limits (timeouts, no internet access). Evaluation used predefined tasks and metrics. Human oversight was maintained throughout all experiments.

Open Challenges

Speed vs. oversight: Self-improving systems may eventually evolve faster than humans can audit. The paper acknowledges this tension and calls for ongoing societal discussion about trust, transparency, and the roles these systems should play.

Bias amplification: If benchmarks encode biases, DGM-H will faithfully optimize for them. Careful benchmark design is essential.

Evaluation gaming: Like Goodhart's law ("when a measure becomes a target, it ceases to be a good measure"), agents might find ways to score well without truly solving the task. Diverse, refreshed evaluation protocols are needed.

What This Changes

Beyond Domain-Specific Engineering

Previous self-improving systems needed humans to manually design the improvement recipe for each new domain. DGM-H eliminates this requirement. The same system, starting from the same simple initial agent, self-improves across coding, paper review, robotics, and math grading.

Transferable Self-Improvement Skills

The meta-level skills DGM-H learns (memory systems, performance tracking, strategic planning) are general-purpose. They transfer across domains and continue to compound. This is the first demonstration that how to improve can be learned in a domain-general way.

Limitations

The system currently works with a fixed task distribution — it can't yet create its own training tasks. Some outer-loop components (parent selection, evaluation protocols) remain fixed in main experiments, though preliminary results show these can also be modified. The difference between DGM-H + transfer and DGM-H from scratch was not statistically significant in the compounding experiment, though the trend was consistently positive.

What You Can Tell Someone Over Coffee

🔑 Key Takeaway #1

A hyperagent merges "the thing that solves the problem" and "the thing that decides how to improve" into one editable program. This lets the AI improve not just its performance, but its improvement strategy — a loop that can potentially accelerate without limit.

🔑 Key Takeaway #2

DGM-Hyperagents work across four very different domains (coding, paper review, robotics, math grading) — all starting from the same minimal initial agent. No domain-specific engineering required.

🔑 Key Takeaway #3

The system autonomously invents its own tools for improvement: memory systems, performance trackers, bias detectors, strategic planners. These skills transfer across domains and compound across runs.

🔑 Key Takeaway #4

Both self-improvement AND open-ended exploration are necessary. Remove either one and progress stalls. The archive of diverse agent variants is crucial — some bad agents have great grandchildren.

🔑 Key Takeaway #5

This is early evidence, not a finished product. Safety considerations are real and acknowledged. But this work suggests a path toward AI systems that don't just search for better solutions — they continually improve their search for how to improve.