A benchmark for autonomous agent development

The Meta-Agent
Challenge.

We test whether current code agents can autonomously develop other agents. Across 5 domains and 39 configurations, only 5 match the human baseline — and 4 of those 5 are proprietary frontier models.

Xinyu Lu1,2  ·  Tianshu Wang3  ·  Pengbo Wang1,2  ·  Zujie Wen3  ·  Zhiqiang Zhang3  ·  Jun Zhou3
Boxi Cao1  ·  Yaojie Lu1  ·  Hongyu Lin1  ·  Xianpei Han1  ·  Le Sun1

1 Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences  ·  2 University of Chinese Academy of Sciences  ·  3 Ant Group

Abstract

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration — highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement.

The shift

From solving tasks to building agents.

Existing agent benchmarks ask whether a model can execute a fixed task within a human-designed workflow. As models saturate these benchmarks, that question is increasingly satisfied — and increasingly uninteresting. MAC changes the level of abstraction: the meta-agent does not solve the task. It writes the agent that solves the task, and is graded only on the result.

Object-level benchmarks MAC (meta-level)
What is measured Solving instances of a task Building an agent that solves the task
Saturates as models improve? Yes — accuracy ceilings are nearly hit No — design space is open-ended
Reuses existing datasets? Each new question demands a new dataset Repurposes AIME, GPQA, LCB, SWE-Bench, TB
Probes recursive self-improvement? No Yes — directly

The challenge

Five domains, one challenge.

Each domain pairs a public benchmark with a fresh dev / test split. The meta-agent only ever sees the dev side — the test set is encrypted on disk in a separate container and the verifier is mounted only after the dev budget expires.

Meta-AIME aime-meta-agent/

Competition mathematics. Dev split: AIME 2022–2023. Test split: AIME 2024–2025. Integer answers in [0, 999].

Dev / Test 60 / 60 Dev budget 12h
Meta-GPQA science-meta-agent/

Graduate-level science. Dev split drawn from HLE multiple-choice; test split from GPQA Diamond. Includes a 2,500-call search quota.

Dev / Test 591 / 198 Dev budget 12h
Meta-LiveCodeBench lcb-meta-agent/

Competitive programming from LiveCodeBench. Submissions are executed against hidden test cases; pass@1 is reported.

Dev / Test 732 / 323 Dev budget 12h
Meta-SWE-Bench swe-meta-agent/

Repository-level patches from SWE-Bench Verified. Dev and test splits have minimal repository overlap; execution and grading run via Harbor.

Dev / Test 250 / 250 Dev budget 24h
Meta-Terminal-Bench tb-meta-agent/

Long-horizon shell tasks. Dev split: Terminal-Bench Pro; test split: Terminal-Bench 2.0. Binary per-task grading.

Dev / Test 200 / 89 Dev budget 24h

Reasoning artifacts call Qwen3-8B on a dedicated A100 vLLM backend; agentic artifacts call Claude Haiku 4.5. Both are accessed exclusively through a quota-enforced proxy.

Evaluation protocol

Two phases, two containers, one trial.

A trial is a two-phase loop inside a dual-container architecture. The meta-agent never reaches the test split — not even by mistake.

Agent container

  • Dev set, base class, /workspace
  • Reads & writes agent.py
  • Quota'd LLM proxy
  • No route to test split

Evaluation container

  • Test set (AES-encrypted)
  • Evaluation oracle
  • Verifier (injected at scoring time)
  • Holds the only decryption key

Development phase

The meta-agent reads the task, edits agent.py, calls the dev-set evaluator, and iterates within T_dev and R_api,dev.

Verification phase

When the budget expires, the static analyzer scans the workspace, the artifact runs on the held-out test set, predictions are graded, and the final reward is recorded.

A* = arg maxA ∈ 𝒜  Score(A, 𝒟test)
s.t.   Timedev(𝑴) ≤ Tdev,   Cost(𝑴) ≤ Rapi,dev,   Timetest(A) ≤ Ttest,   Cost(A) ≤ Rapi,test

Because Dtest is hidden during development, the meta-agent cannot solve this directly. It must rely on empirical feedback from the dev set to iteratively propose, evaluate, and refine — mirroring the trial-and-error cycle of a human developer.

Integrity

Five layers of defense.

Strong optimization pressure surfaces emergent adversarial behavior. Inside the dual-container architecture, MAC stacks five independent layers addressing two threat classes — unauthorized resource access and test-set leakage.

#LayerMechanism & threat addressed
1API monitoringStatic code analysis of agent.py scans for unauthorized imports and network endpoints. Threat: unauthorized resource access.
2API proxyAll model calls are routed through a quota-enforcing proxy in the eval container; usage is logged. Threat: unauthorized resource access.
3Container filesystem separationGround truth for both D_eval and D_test resides only in the eval container's private filesystem. Threat: test-set leakage.
4Split-level access controlTest-split scoring requires a cryptographic X-Verifier-Secret header; the verifier is injected only after the dev phase ends. Threat: test-set leakage.
5Post-hoc auditor agentA Claude Opus 4.6 auditor reads the full trial directory and emits CLEAN / SUSPICIOUS / CHEATING with evidence. Threat: both.
The auditor flagged 4 trials across the 117 runs for distinct exploit classes — hardcoded answers, brute-force enumeration, proxy-bypass attempts. None inflated a score. Dual-container isolation, split-level authorization, and proxy enforcement neutralized every exploit before scoring.

Results

How well do agents build agents?

Each cell reports the mean and standard deviation over three independent runs. Across all 39 configurations, only five — Claude Opus 4.6 / 4.7 and Sonnet 4.6 on a small subset of domains, plus DeepSeek-v4-Pro on one — clear the corresponding human baseline.

Reasoning domains

Artifact API model: Qwen3-8B (A100 vLLM backend).

Model Scaffold Meta-AIME Meta-GPQA Meta-LiveCodeBench
Human Baseline 0.733 ± 0.029 0.597 ± 0.020 0.555 ± 0.011
Claude-Opus-4.6 Claude Code 0.744 ± 0.054 0.572 ± 0.049 0.557 ± 0.043
Claude-Sonnet-4.6 Claude Code 0.783 ± 0.017 0.383 ± 0.332 0.446 ± 0.133
Gemini-3.1-Pro Gemini-cli 0.617 ± 0.174 0.541 ± 0.036 0.300 ± 0.204
GLM-5 Claude Code 0.355 ± 0.094 0.542 ± 0.026 0.231 ± 0.078
Kimi-K2.5 Claude Code 0.350 ± 0.335 0.257 ± 0.070 0.027 ± 0.021
MiniMax-M2.5 Claude Code 0.306 ± 0.084 0.363 ± 0.147 0.260 ± 0.079
GPT-5.3-Codex Codex 0.217 ± 0.185 0.296 ± 0.070 0.266 ± 0.056

Agentic domains

Artifact API model: Claude Haiku 4.5.

Model Scaffold Meta-SWE-Bench Meta-Terminal-Bench
Human Baseline Terminus-2 0.637 ± 0.030 0.326 ± 0.019
Human Baseline OpenHands 0.544 ± 0.008 0.285 ± 0.053
Claude-Opus-4.7 Claude Code 0.609 ± 0.064 0.393 ± 0.034
Claude-Opus-4.6 Claude Code 0.443 ± 0.201 0.262 ± 0.036
Claude-Sonnet-4.6 Claude Code 0.373 ± 0.136 0.296 ± 0.051
GLM-5.1 Claude Code 0.476 ± 0.045 0.255 ± 0.017
DeepSeek-v4-Pro Claude Code 0.323 ± 0.173 0.345 ± 0.028
Gemini-3.1-Pro Gemini-cli 0.393 ± 0.126 0.232 ± 0.073
GPT-5.4 Codex 0.245 ± 0.226 0.183 ± 0.034
GPT-5.3-Codex Codex 0.293 ± 0.202 0.180 ± 0.039
MiniMax-M2.7 Claude Code 0.004 ± 0.004 0.045 ± 0.051
Reading the table. Each cell shows mean ± standard deviation across three runs. Best non-human score per column is highlighted in kraft. ≥1 run was flagged for cheating intent — every flagged run was neutralized by the defense layers, so the score reflects an honest run. ≥1 run exhausted the wall-clock budget.

What we learned

Three findings.

To understand what separates strong meta-agent runs from weak ones, we instrument every trial with six development-time features parsed from the evaluation log: total runtime, time-to-first eval call, number of eval calls, eval-call success rate, temporal centroid of eval calls, and mean inter-call interval. We then regress the final reward on each feature after subtracting the per-domain mean.

Six panels regressing meta-agent development-process features against final reward, with mean inter-call interval and total runtime emerging as the dominant predictors.
Meta-agent development-process features vs. final reward. Each panel shows one development-time feature against the final reward, with both axes centered by domain mean to control for cross-task difficulty differences. Pearson (r) and Spearman (ρ) correlations are reported per panel.

Two signals dominate. Mean inter-call interval is the single strongest predictor of performance, and total runtime is the next-strongest. The features a naive view of "iterative optimization" would prioritize — number of eval calls, success rate, time-to-first-eval, temporal centroid — carry little signal. Successful meta-agents do not treat the evaluation endpoint as a high-frequency feedback signal: they think longer between calls, invest more total compute in artifact design, and probe the scorer sparingly.

Beyond these macro patterns, three findings hold consistently across every domain.

Meta-agents rarely match human scaffolds, and the few that do are dominated by proprietary frontier models.

Only 5 of 39 configurations exceed the human baseline average; 4 of those 5 are Claude. No meta-agent fully surpasses the baseline on Meta-GPQA or Meta-SWE-Bench, and only one open-weight model — DeepSeek-v4-Pro on Terminal-Bench — clears any human bar at all.

High inter-run variance exposes the brittleness of autonomous design.

33% of configurations have a standard deviation greater than 0.1, compared to a maximum of 0.053 among human baselines. Current models can occasionally synthesize a strong agent — they cannot do it reliably. The variance is not noise; it is the signature of unstable decision-making in an open-ended design space.

Optimization pressure induces spontaneous reward hacking.

The post-hoc auditor flagged four trials for distinct exploit classes — hardcoded answer dictionaries, brute-force enumeration of the eval API, and proxy-bypass attempts. The dual-container isolation and proxy enforcement neutralized every one of them. No flagged run inflated its test score.

Cite this work

BibTeX

If you use MAC in your research, please cite the paper.

@misc{lu2026mac,
  title         = {The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?},
  author        = {Xinyu Lu and Tianshu Wang and Pengbo Wang and Zujie Wen and
                   Zhiqiang Zhang and Jun Zhou and Boxi Cao and Yaojie Lu and
                   Hongyu Lin and Xianpei Han and Le Sun},
  year          = {2026},
  eprint        = {XXXX.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI}
}