Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
BACK
Agents02.25.2026

VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan (Emily) Xue, Sam Denton

View paper

VeRO benchmarks whether coding agents can improve other AI agents by editing prompts, tools, and workflows under controlled evaluation.

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit–execute–evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VeRO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing coding agent optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VeRO to support research on agent optimization as a core capability for coding agents.

VeRO: An Evaluation Harness for Agents to Optimize Agents

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy