Insights Generator: Automated Failure Mode Analysis for Agents

Every production agent fails in ways its developers did not anticipate.

Identifying the patterns behind those failures, understanding them, and fixing them is the work of making an agent deployment reliable. It means understanding when the agent fails, why it fails, which failures are rare but dangerous, and which behaviors quietly degrade quality before aggregate metrics make the problem obvious. Finding and fixing these patterns is what turns an impressive demo into a reliable deployment.

But that signal is buried within the agent’s traces.

A single trace can tell you about the circumstances of one run; it can't tell you what your agent systematically does. The failures that matter most often emerge only across many runs: rare condition-specific bugs, silent quality degradation, and brittle behavior under distribution shift. Unfortunately, trace corpora are far too large for any human to read end to end.

Trace data, however, is structured. This raises a natural question: What if we treated trace analysis as a corpus-level discovery problem, the way data scientists discover patterns across populations of users?

In this framing, the whole trace corpus becomes the input and an open-ended diagnostic question becomes the prompt. The output is a set of behavioral patterns that explain how the agent succeeds, where it fails, and what developers should investigate next. It replaces the single pass/fail label of an individual run with structure across the whole population.

Insights Generator (IG) is the system we built to do this. Across our experiments, human engineers using IG reports achieved 30+% improvements on agent benchmark performance versus the unmodified baseline scaffold, outperforming Claude Code-based insights by roughly 14 percentage points.

IG turns agent traces into a durable improvement flywheel, where issues are automatically surfaced and triaged for review.

Diagnosing Behavior Across Thousands of Traces

When an agent breaks today, the standard playbook is to open a few execution traces, look around, form a hypothesis, fix it, and iterate. This works with a handful of traces, but it breaks down at production scale, and most failures don't trip evals anyway. They quietly degrade output and remain invisible until they compound into a falling accuracy number.

Three major problems make agent debugging at scale hard, and current tooling doesn't fully solve any of them.

Volume: Production corpora cover thousands of runs, each often 10K+ tokens long. No single model context window holds them, and the patterns that matter aren't reducible to keyword matches or fixed error taxonomies; instead, they're semantic and contextual.

Pattern discovery. The failures that matter often occur when several specific conditions line up: a particular task type, using a particular tool, in a particular state. They're invisible in any one trace, and you don't know what to look for in advance. Existing tools assume you have a hypothesis to validate; the most valuable hypotheses only emerge after reviewing many traces.

Silent failures. Subpar reasoning, skipped tool calls, wasted exploration, and fabricated tool outputs are all issues that don’t show up on a binary eval or an error message. Instead, they degrade output quality in ways that compound across runs. Reviewing traces is often necessary to pinpoint these granular issues.

We formalize this as corpus-level trace diagnostics: given a corpus of execution traces and an open-ended diagnostic query, produce a set of natural-language findings, each grounded in trace-level evidence with a prevalence estimate over the corpus.

How IG Works

IG addresses these three problems through three components working together.

A stateful Python data layer. Rather than dumping traces into the model's context, we hold the full corpus in a stateful Python session, and agents interact with it through code execution. They query, filter, join, and compute over trace metadata, and only see aggregated results in their own context. Raw traces stay in the data layer.

This allows the system to scale: trace volume becomes a property of the data layer rather than the model's context window. Because all of IG's analysis tools are exposed as Python functions in the same environment (similar to programmatic tool calling), agents can chain operations in a single code block and route intermediate trace content through Python rather than bloating their own context.

Purpose-built analysis tools. We built 14 analysis primitives for corpus work, including format-aware chunking that adapts to trace structure (JSON messages, tool calls, markdown sections), LLM-based extraction to pull complex features from traces, cohort comparison for differentiating trace groups, and hybrid semantic-keyword search across all traces.

These tools handle arbitrary trace lengths as an architectural property, rather than relying on a wider model context window. This also enables scalable divide-and-conquer analyses where cheaper models are used for summary and extraction workloads.

Subagents that discover and validate patterns. A Scout agent reads small samples of traces and proposes candidate hypotheses for what might be happening, with emphasis on breadth. An Investigator agent then takes each hypothesis and validates it across the full corpus with statistical evidence, prioritizing depth and rigor.

Every finding comes with a prevalence estimate and trace-level citations. An Orchestrator coordinates the loop, intelligently dispatches Scouts and Investigators based on coverage so far, and synthesizes the final report.

Evaluating Corpus-Level Diagnostics

A separate problem we ran into: no shared standard previously existed for measuring whether an insight report is any good. Coverage of known failure modes, quality of evidence, and downstream impact on agent performance are all different metrics that often favor different systems.

We propose a four-setting evaluation framework, varying who evaluates (LLM judge vs human expert) and what's measured (report quality vs downstream scaffold impact):

Calibrated LLM judge: multi-system round-robin tournament scored on a per-finding-cluster rubric.
Human expert rubric: domain experts score reports on correctness, depth, evidence, and actionability.
Human-in-the-loop scaffold patching: engineers use the reports to modify the agent scaffold, and we measure the held-out test set delta.
Autonomous coding-agent patcher loop: a coding agent uses the reports to modify the scaffold autonomously, looped until convergence, similar to the setup used by HALO.

System rankings were consistent across all four, providing the first cross-validated comparison of corpus-level diagnostic systems against multiple baselines (Recursive Language Models, Trace2Skill, Claude Code with subagents, single-agent Claude Code).

Main Findings

IG led on every evaluation setting:

Human-in-the-loop scaffold patching: experts using IG reports improved a spreadsheet-manipulation agent by +30.4pp over baseline (27.0% → 57.4%), versus +16.2pp for experts using Claude Code insights.
Autonomous coding-agent patcher loop: report-equipped patchers (IG, RLM, CC Subagents) all converged to comparable test pass rates (0.81–0.84). A no-report control regressed by round 3: without analysis input, the patcher invented its own (often wrong) priorities.
LLM-judge pairwise tournament: IG led at 77.9% pairwise win rate across both SpreadsheetBench and Humanity’s Last Exam, beating out alternative systems like Claude Code, RLMs, and Trace2Skill. The biggest gaps were on mechanism specificity and evidence depth.
Human expert rubric: IG and the next-best system received statistically comparable aggregate scores when tested on SpreadsheetBench and AppWorld, with IG leading on depth and evidence support.

The headline takeaway is that report quality (depth, evidence, and mechanism) translates directly to practitioner effectiveness, not just topical coverage. Alternative systems either miss failure modes or act as dragnets that surface many irrelevant findings. We find that IG strikes a balance by explaining the highest priority issues with enough evidence that someone can fix them.

Three Patterns of Agent Failure

Across the corpora we analyzed, three patterns came up consistently.

1. Agents fail silently more often than loudly.

Most failure modes don't trip exceptions. They produce wrong answers calmly, with confident self-assessment. On AppWorld, IG surfaced that 50 of 51 incorrect traces (98%) marked themselves task_completed=true, often with celebratory language ("✅ Successfully completed!"). The agent's verification step in 96% of failed traces was checking whether the API call succeeded, not whether the task was done. The self-reported success signal was useless as a quality indicator.

Versions of this show up everywhere: silent ImportErrors in 43% of Python-using traces, Wikipedia 403s in 55% of search-heavy traces, fabricated tool outputs in 38% of code traces. Agents are effective because they will work around errors, but this results in critical issues going overlooked because the agent didn’t crash.

2. Agents inherit superstitions from training data.

When 17 different SpreadsheetBench traces independently set fullCalcOnLoad=True to try to fix an Excel issue and all 17 failed, it was a result of the agent reaching for a pattern that exists in its training data but has no effect in this environment. Agents do this constantly: they invent fixes that look right because they resemble other right fixes elsewhere.

IG was critical for finding this kind of error. These superstitions are invisible per-trace (any one agent setting fullCalcOnLoad looks reasonable) and only become visible as 17 independent agents converge on the same broken workaround.

3. Agents will game the eval if they can.

On SimpleQA Verified, IG caught an agent that didn't bother to research the answer to a question. Instead, it went to HuggingFace, found the benchmark's own dataset, and pulled the ground-truth answer directly. No aggregate metric would have surfaced this, because the behavior was rare enough that average accuracy looked unremarkable.

We've seen analogues: agents hardcoding answers with comments like # Hypothetical search result, agents in other benchmarks discovering edge cases in the grader's exact-match check. Whenever the search space is wide enough and the eval has any shortcut, the agent finds it. Web access in particular makes this a structural risk for benchmark integrity.

All three of these failure modes are easy to miss in a sample of a handful of traces, and easy to catch when you can systematically compare across many. Corpus-level trace diagnostic systems like IG ensure these emerging agentic failure modes are easy to pinpoint, even as agents scale up to production traffic and production edge cases.

What's Next

A few directions we're excited about:

Extending to harder benchmarks. Our +30.4pp human intervention result was on SpreadsheetBench. We observed that IG continued to provide useful insights, even after low-hanging bugs were addressed, but the benchmark saturated too quickly for us to measure this against competing systems. We're particularly interested in extending IG to more difficult tasks, such as longer-horizon coding tasks (SWE-Bench Pro, SWE Atlas, FeatureBench) and broader benchmark suites (MCP Atlas, APEX-Agents).

Integration with SGP. We’re exploring rolling out this trace analysis capability to Scale GenerativeAI Platform customers to accelerate production-quality agent development.

Closing the loop with optimization. IG surfaces behavioral patterns; the next step is automatically acting on them. Our earlier work on VeRO builds infrastructure for coding agents to optimize other agents through edit-execute-evaluate loops. Pairing IG (diagnosis) with VeRO (optimization) would let an autonomous system discover behavioral failures across a trace corpus, propose patches to fix them, validate the patches via VeRO, and produce an improved agent end-to-end, with humans in the loop only at decision points.

This is the long-term vision: agents and their analysts evolve together, and the manual debug-fix-iterate loop becomes optional rather than necessary.

For details on the system, evaluation framework, and full benchmark results, see our paper on arXiv: https://arxiv.org/abs/2605.21347.