HyperAgents are self-improving AI systems that don't just search for better solutions — they continually improve how they search.
Zhang, Zhao, Yang, Foerster, Clune, Jiang, Devlin & Shavrina | UBC, Vector Institute, Edinburgh, NYU, Meta
Imagine you're learning to cook. You follow a recipe, taste the result, and adjust. Over time, you get better at that dish. But what if you could also upgrade the recipe book itself — refine your tasting technique, invent new cooking methods, even learn how to learn from failures faster? That's the difference between regular improvement and meta-improvement.
Most AI systems that "improve themselves" are stuck at the first level. They have a fixed improvement strategy — like always following the same recipe book. The strategy was designed by a human, and the AI can never change it. This creates a hard ceiling on progress.
A human designs the improvement recipe. The AI follows it exactly, forever. If the recipe has blind spots, the AI can never overcome them.
The AI can rewrite its own improvement recipe. It notices what works, invents new strategies, and builds better tools for future upgrades.
This paper introduces hyperagents — AI programs that can modify every part of themselves, including the part responsible for deciding how to modify themselves. The result? AI that improves its ability to improve, across any domain — not just coding.
Before we get to the breakthrough, let's build up three key concepts, one at a time.
Think of a restaurant. The task agent is the chef — it cooks the food (solves the actual problem). The meta agent is the restaurant manager — it decides how to train the chef, what recipes to try, and how to restructure the kitchen. In AI, the task agent does the work; the meta agent modifies the system to do work better.
Imagine evolution in nature. It doesn't just keep the single "best" creature — it maintains a diverse population where different traits can recombine in surprising ways. Similarly, the system here maintains an archive — a growing library of agent variants. New agents are created by selecting promising "parents" from the archive and modifying them. Some variants are stepping stones: they're not great on their own, but their descendants become excellent.
The Darwin Gödel Machine (DGM) was a prior system that combined self-modification with open-ended exploration — but only for coding tasks. It worked because the thing the agent was evaluated on (writing code) was the same skill it needed to improve itself (also writing code). But what if you want an AI that reviews research papers? Getting better at reviewing papers doesn't automatically make the AI better at rewriting its own code. The skills don't align. That's the gap hyperagents fill.
Here's the key insight: put the chef and the manager in the same editable document. When the AI modifies itself, it can change how it solves tasks and how it decides what changes to make. This is called metacognitive self-modification — improving the improvement process itself.
The system begins with a bare-bones hyperagent: a task agent that just calls an AI model once, and a meta agent that simply says "modify any part of the codebase." No clever tricks.
A parent hyperagent is selected from the archive. Its meta agent reads the code, reviews past performance, and proposes modifications — to any part: the task agent, the meta agent, or both.
The new hyperagent is tested on real tasks (coding problems, paper reviews, robotics challenges, math grading). Its performance is recorded.
The new variant joins the archive. Better-performing agents are more likely to be selected as parents. Over time, the archive becomes a rich landscape of increasingly capable agents — and increasingly capable improvers.
You might think: "Why not just add a meta-meta agent to improve the meta agent?" The problem is that creates an infinite regress — you always need one more level above. Hyperagents solve this by making the system self-referential: the meta agent is part of the same program it modifies. It's like a computer program that can edit its own source code, including the part that decides which edits to make. Written in Python (which is Turing-complete), a hyperagent can in principle build any computable machine.
The researchers tested DGM-Hyperagents (DGM-H) on four wildly different tasks, demonstrating that the approach isn't limited to one domain. Each experiment ran 5 times with medians and 95% confidence intervals reported.
Task: Given a code repository and a natural-language instruction, modify the code correctly. Tested with pass@1 (no peeking at test results).
Why it matters: This is the DGM's home turf. Can DGM-H match a system custom-built for coding?
Result: DGM-H improved from 0.084 → 0.267 on the full benchmark (165 unseen tasks). The original DGM achieved 0.142 → 0.307. DGM-H reaches comparable performance despite not being designed for coding.
Training subset: 60 tasks. Full test: 165 tasks. FM: Claude 3.5 Sonnet (modification) + o3-mini (evaluation). 80 iterations.
Task: Read a full AI research paper and predict accept/reject, matching real conference decisions from ICLR 2024/2025 and NeurIPS 2023/2024.
Why it matters: This is a subjective task with no objective ground truth — a hard test for self-improvement.
Result: DGM-H improved from 0.0 → 0.710 on test tasks. The initial agent scored 0.0 (couldn't even format outputs correctly). The static baseline (AI-Scientist-v2) scored 0.630. DGM-H surpassed it.
100 train / 100 validation / 100 test tasks. FM: Claude 4.5 Sonnet (modification) + GPT-4o (evaluation). 100 iterations.
Task: Design a reward function so a simulated quadruped robot learns a specific behavior (walking forward during training, maximizing torso height at test).
Why it matters: This goes beyond language — agents must interact with physics simulators and reinforcement learning.
Result: DGM-H improved from 0.060 → 0.372 on test tasks, surpassing the default reward function (0.348). Crucially, DGM-H agents learned to make the robot jump rather than just stand tall — escaping a local optimum that trapped other methods.
6 train / 6 test evaluations. FM: Claude 4.5 Sonnet for both. 100 iterations.
Task: Grade solutions to International Mathematical Olympiad problems (scores: 0, 1, 6, or 7 out of 7), matching expert human grades.
Why it matters: This domain was held out as a meta-evaluation — testing whether improvements learned elsewhere transfer here.
Result: With transfer from other domains + continued self-improvement, DGM-H improved ProofAutoGrader's accuracy from 0.561 → 0.601 on the full IMO-GradingBench, lowering mean absolute error from 0.178 to 0.175.
100 train / 100 validation / 100 test tasks. FM: Claude 4.5 Sonnet (modification) + o4-mini (evaluation). 200 iterations.
The researchers systematically removed components to prove each one is necessary:
| Method | Paper Review (Test) | Robotics (Test) |
|---|---|---|
| Initial Agent | 0.0 | 0.060 |
| DGM (original, coding-specific) | 0.0 | 0.0 |
| DGM-custom (manually adapted) | 0.590 | 0.348 |
| DGM-H w/o self-improve | 0.0 | 0.213 |
| DGM-H w/o open-ended exploration | 0.0 | 0.116 |
| DGM-H (full system) | 0.710 | 0.372 |
DGM-H significantly outperforms all ablations (p < 0.05, Wilcoxon signed-rank test). Medians over 5 runs with 95% bootstrap CIs.
This is the most remarkable finding. The researchers didn't just show that DGM-H improves task performance — they showed it improves its ability to generate improved agents. And these meta-improvements transfer across domains.
Think of it like this: you learn how to study effectively while preparing for history exams. Then you apply those study skills to chemistry — and they still work, even though the subjects are totally different. That's what happens here.
imp@50 measures how much a meta agent can improve a task agent within 50 iterations. The initial meta agent achieves imp@50 ≈ 0 on Olympiad math grading — it can't generate better agents. Transfer agents from the original DGM also achieve ≈ 0. But transfer hyperagents from DGM-H achieve imp@50 = 0.630 — a massive leap, in a domain they were never trained on.
The researchers found that DGM-H autonomously developed several general-purpose meta-capabilities — none of which were explicitly requested:
The hyperagent invented its own memory system — storing not just performance numbers, but synthesized insights: "Gen55 has best accuracy but is too harsh. Gen64 improved balance but lower accuracy. Need to combine gen55's critical reasoning with gen64's balance." Later generations consult this memory to avoid repeating past mistakes.
The system created a PerformanceTracker class that logs scores across generations, computes moving averages, and identifies improvement trends — enabling data-driven decisions about what to change next.
In paper review, the system detected when its predictions collapsed (e.g., accepting 99% of papers). It automatically tracked label distributions, flagged bias when one class exceeded 80%, and surfaced corrective suggestions to the meta agent. After detection and correction, accuracy jumped from 49% to 63%.
The hyperagent learned to allocate its compute budget strategically: making bold architectural changes early (when there are many iterations left) and conservative fine-tuning late. This mirrors human intuition about exploration vs. exploitation, but was discovered autonomously.
Improvements don't just transfer — they compound. Starting DGM-H from a transferred hyperagent on Olympiad math grading yields higher final performance (0.640) than starting from scratch (0.610). Adding ProofAutoGrader as the initial task agent pushes it to 0.700, surpassing ProofAutoGrader's own score of 0.670.
Drag the slider to see how DGM-H performance changes as iterations increase. Notice how the system keeps finding improvements even at later stages — evidence that it's not hitting a ceiling.
A system that can rewrite its own improvement process raises real concerns. The researchers were proactive about this.
All agent-generated code ran in sandboxed environments with strict resource limits (timeouts, no internet access). Evaluation used predefined tasks and metrics. Human oversight was maintained throughout all experiments.
Speed vs. oversight: Self-improving systems may eventually evolve faster than humans can audit. The paper acknowledges this tension and calls for ongoing societal discussion about trust, transparency, and the roles these systems should play.
Bias amplification: If benchmarks encode biases, DGM-H will faithfully optimize for them. Careful benchmark design is essential.
Evaluation gaming: Like Goodhart's law ("when a measure becomes a target, it ceases to be a good measure"), agents might find ways to score well without truly solving the task. Diverse, refreshed evaluation protocols are needed.
Previous self-improving systems needed humans to manually design the improvement recipe for each new domain. DGM-H eliminates this requirement. The same system, starting from the same simple initial agent, self-improves across coding, paper review, robotics, and math grading.
The meta-level skills DGM-H learns (memory systems, performance tracking, strategic planning) are general-purpose. They transfer across domains and continue to compound. This is the first demonstration that how to improve can be learned in a domain-general way.
The system currently works with a fixed task distribution — it can't yet create its own training tasks. Some outer-loop components (parent selection, evaluation protocols) remain fixed in main experiments, though preliminary results show these can also be modified. The difference between DGM-H + transfer and DGM-H from scratch was not statistically significant in the compounding experiment, though the trend was consistently positive.
A hyperagent merges "the thing that solves the problem" and "the thing that decides how to improve" into one editable program. This lets the AI improve not just its performance, but its improvement strategy — a loop that can potentially accelerate without limit.
DGM-Hyperagents work across four very different domains (coding, paper review, robotics, math grading) — all starting from the same minimal initial agent. No domain-specific engineering required.
The system autonomously invents its own tools for improvement: memory systems, performance trackers, bias detectors, strategic planners. These skills transfer across domains and compound across runs.
Both self-improvement AND open-ended exploration are necessary. Remove either one and progress stalls. The archive of diverse agent variants is crucial — some bad agents have great grandchildren.
This is early evidence, not a finished product. Safety considerations are real and acknowledged. But this work suggests a path toward AI systems that don't just search for better solutions — they continually improve their search for how to improve.