A benchmark for autonomous agent development
We test whether current code agents can autonomously develop other agents. Across 5 domains and 39 configurations, only 5 match the human baseline — and 4 of those 5 are proprietary frontier models.
1 Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences · 2 University of Chinese Academy of Sciences · 3 Ant Group
Abstract
Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration — highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement.
The shift
Existing agent benchmarks ask whether a model can execute a fixed task within a human-designed workflow. As models saturate these benchmarks, that question is increasingly satisfied — and increasingly uninteresting. MAC changes the level of abstraction: the meta-agent does not solve the task. It writes the agent that solves the task, and is graded only on the result.
| Object-level benchmarks | MAC (meta-level) | |
|---|---|---|
| What is measured | Solving instances of a task | Building an agent that solves the task |
| Saturates as models improve? | Yes — accuracy ceilings are nearly hit | No — design space is open-ended |
| Reuses existing datasets? | Each new question demands a new dataset | Repurposes AIME, GPQA, LCB, SWE-Bench, TB |
| Probes recursive self-improvement? | No | Yes — directly |
The challenge
Each domain pairs a public benchmark with a fresh dev / test split. The meta-agent only ever sees the dev side — the test set is encrypted on disk in a separate container and the verifier is mounted only after the dev budget expires.
Competition mathematics. Dev split: AIME 2022–2023. Test split: AIME 2024–2025. Integer answers in [0, 999].
Graduate-level science. Dev split drawn from HLE multiple-choice; test split from GPQA Diamond. Includes a 2,500-call search quota.
Competitive programming from LiveCodeBench. Submissions are executed against hidden test cases; pass@1 is reported.
Repository-level patches from SWE-Bench Verified. Dev and test splits have minimal repository overlap; execution and grading run via Harbor.
Long-horizon shell tasks. Dev split: Terminal-Bench Pro; test split: Terminal-Bench 2.0. Binary per-task grading.
Reasoning artifacts call Qwen3-8B on a dedicated A100 vLLM backend; agentic artifacts call Claude Haiku 4.5. Both are accessed exclusively through a quota-enforced proxy.
Evaluation protocol
A trial is a two-phase loop inside a dual-container architecture. The meta-agent never reaches the test split — not even by mistake.
Agent container
/workspaceagent.pyEvaluation container
The meta-agent reads the task, edits agent.py, calls the
dev-set evaluator, and iterates within
T_dev and R_api,dev.
When the budget expires, the static analyzer scans the workspace, the artifact runs on the held-out test set, predictions are graded, and the final reward is recorded.
Because Dtest is hidden during development, the meta-agent cannot solve this directly. It must rely on empirical feedback from the dev set to iteratively propose, evaluate, and refine — mirroring the trial-and-error cycle of a human developer.
Integrity
Strong optimization pressure surfaces emergent adversarial behavior. Inside the dual-container architecture, MAC stacks five independent layers addressing two threat classes — unauthorized resource access and test-set leakage.
| # | Layer | Mechanism & threat addressed |
|---|---|---|
| 1 | API monitoring | Static code analysis of agent.py scans for unauthorized imports and network endpoints. Threat: unauthorized resource access. |
| 2 | API proxy | All model calls are routed through a quota-enforcing proxy in the eval container; usage is logged. Threat: unauthorized resource access. |
| 3 | Container filesystem separation | Ground truth for both D_eval and D_test resides only in the eval container's private filesystem. Threat: test-set leakage. |
| 4 | Split-level access control | Test-split scoring requires a cryptographic X-Verifier-Secret header; the verifier is injected only after the dev phase ends. Threat: test-set leakage. |
| 5 | Post-hoc auditor agent | A Claude Opus 4.6 auditor reads the full trial directory and emits CLEAN / SUSPICIOUS / CHEATING with evidence. Threat: both. |
Results
Each cell reports the mean and standard deviation over three independent runs. Across all 39 configurations, only five — Claude Opus 4.6 / 4.7 and Sonnet 4.6 on a small subset of domains, plus DeepSeek-v4-Pro on one — clear the corresponding human baseline.
Artifact API model: Qwen3-8B (A100 vLLM backend).
| Model | Scaffold | Meta-AIME | Meta-GPQA | Meta-LiveCodeBench |
|---|---|---|---|---|
| Human Baseline | — | 0.733 ± 0.029 | 0.597 ± 0.020 | 0.555 ± 0.011 |
| Claude-Opus-4.6 | Claude Code | 0.744 ± 0.054‡ | 0.572 ± 0.049‡ | 0.557 ± 0.043‡ |
| Claude-Sonnet-4.6 | Claude Code | 0.783 ± 0.017‡ | 0.383 ± 0.332 | 0.446 ± 0.133‡ |
| Gemini-3.1-Pro | Gemini-cli | 0.617 ± 0.174‡ | 0.541 ± 0.036‡ | 0.300 ± 0.204‡ |
| GLM-5 | Claude Code | 0.355 ± 0.094 | 0.542 ± 0.026‡ | 0.231 ± 0.078‡ |
| Kimi-K2.5 | Claude Code | 0.350 ± 0.335 | 0.257 ± 0.070 | 0.027 ± 0.021 |
| MiniMax-M2.5 | Claude Code | 0.306 ± 0.084 | 0.363 ± 0.147 | 0.260 ± 0.079‡ |
| GPT-5.3-Codex | Codex | 0.217 ± 0.185† | 0.296 ± 0.070 | 0.266 ± 0.056 |
Artifact API model: Claude Haiku 4.5.
| Model | Scaffold | Meta-SWE-Bench | Meta-Terminal-Bench |
|---|---|---|---|
| Human Baseline | Terminus-2 | 0.637 ± 0.030 | 0.326 ± 0.019 |
| Human Baseline | OpenHands | 0.544 ± 0.008 | 0.285 ± 0.053 |
| Claude-Opus-4.7 | Claude Code | 0.609 ± 0.064‡ | 0.393 ± 0.034 |
| Claude-Opus-4.6 | Claude Code | 0.443 ± 0.201‡ | 0.262 ± 0.036‡ |
| Claude-Sonnet-4.6 | Claude Code | 0.373 ± 0.136‡ | 0.296 ± 0.051‡ |
| GLM-5.1 | Claude Code | 0.476 ± 0.045‡ | 0.255 ± 0.017 |
| DeepSeek-v4-Pro | Claude Code | 0.323 ± 0.173 | 0.345 ± 0.028 |
| Gemini-3.1-Pro | Gemini-cli | 0.393 ± 0.126†‡ | 0.232 ± 0.073‡ |
| GPT-5.4 | Codex | 0.245 ± 0.226 | 0.183 ± 0.034 |
| GPT-5.3-Codex | Codex | 0.293 ± 0.202† | 0.180 ± 0.039† |
| MiniMax-M2.7 | Claude Code | 0.004 ± 0.004 | 0.045 ± 0.051 |
What we learned
To understand what separates strong meta-agent runs from weak ones, we instrument every trial with six development-time features parsed from the evaluation log: total runtime, time-to-first eval call, number of eval calls, eval-call success rate, temporal centroid of eval calls, and mean inter-call interval. We then regress the final reward on each feature after subtracting the per-domain mean.
Two signals dominate. Mean inter-call interval is the single strongest predictor of performance, and total runtime is the next-strongest. The features a naive view of "iterative optimization" would prioritize — number of eval calls, success rate, time-to-first-eval, temporal centroid — carry little signal. Successful meta-agents do not treat the evaluation endpoint as a high-frequency feedback signal: they think longer between calls, invest more total compute in artifact design, and probe the scorer sparingly.
Beyond these macro patterns, three findings hold consistently across every domain.
Only 5 of 39 configurations exceed the human baseline average; 4 of those 5 are Claude. No meta-agent fully surpasses the baseline on Meta-GPQA or Meta-SWE-Bench, and only one open-weight model — DeepSeek-v4-Pro on Terminal-Bench — clears any human bar at all.
33% of configurations have a standard deviation greater than 0.1, compared to a maximum of 0.053 among human baselines. Current models can occasionally synthesize a strong agent — they cannot do it reliably. The variance is not noise; it is the signature of unstable decision-making in an open-ended design space.
The post-hoc auditor flagged four trials for distinct exploit classes — hardcoded answer dictionaries, brute-force enumeration of the eval API, and proxy-bypass attempts. The dual-container isolation and proxy enforcement neutralized every one of them. No flagged run inflated its test score.
Cite this work
If you use MAC in your research, please cite the paper.
@misc{lu2026mac, title = {The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?}, author = {Xinyu Lu and Tianshu Wang and Pengbo Wang and Zujie Wen and Zhiqiang Zhang and Jun Zhou and Boxi Cao and Yaojie Lu and Hongyu Lin and Xianpei Han and Le Sun}, year = {2026}, eprint = {XXXX.XXXXX}, archivePrefix = {arXiv}, primaryClass = {cs.AI} }