Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon Tasks

TL;DR:

We release LHAW, a dataset-agnostic pipeline for generating underspecified variants of long-horizon tasks alongside 285 tasks from MCP-Atlas, TAC, and SWE-Bench Pro
Frontier models vary widely in their ability to effectively recover from underspecification when given access to a simulated user, reflecting differences in strategic clarification.
Among failure modes of clarification, we find that question quality, overclarification and response misinterpretation vary the most across the models we tested against

Below, we describe the problem setting, pipeline, and results, and explain how LHAW evaluates ambiguity in realistic long-horizon workflows.

Read the full paper here and explore the dataset on Hugging Face.

A Systematic Evaluation of Ambiguity in Long-Horizon Tasks

Long-horizon workflow agents operate under an interaction model that differs fundamentally from conversational assistants. In realistic deployments, asking a human is not free: each clarification can interrupt execution, introduce latency, and impose cognitive switching costs on supervisors. This shifts the evaluation target. The central question is no longer only whether an agent can execute a well-specified task, but whether it can recognize when information is missing and decide when clarification is justified.

Scenarios that warrant clarification can be broadly divided into two categories.

Semantic ambiguity arises when a request admits multiple valid interpretations, each of which could plausibly satisfy the user.
Underspecification, in contrast, occurs when a request is missing information that is required for successful execution. Without acquiring that information, the task is unsolvable or highly likely to fail.

Long-horizon agent benchmarks such as TheAgentCompany (TAC), SWE-Bench Pro, and MCP-Atlas primarily evaluate execution under sufficient specification. Clarification benchmarks, on the other hand, rarely model the costs, timing, and downstream effects of interaction over extended workflows. As a result, we lack a standardized way to measure whether an agent can reliably detect outcome-critical missing information, seek clarification when it meaningfully improves outcomes, and avoid unnecessary interruptions when ambiguity is benign.

LHAW (Long-Horizon Augmented Workflows) is designed to address this gap. LHAW is a dataset-agnostic pipeline that transforms well-specified tasks into controllably underspecified variants within the original task environment, and then empirically validates the resulting ambiguity through agent execution. Rather than relying on prompt-level ambiguity heuristics, LHAW defines ambiguity through observable outcomes. Variants are classified as outcome-critical, divergent, or benign based on how agents behave when executing them.

This framing enables systematic evaluation of clarification behavior in long-horizon settings where interaction is costly, decisions are sequential, and failures are often irreversible.

The LHAW Pipeline

LHAW is a three-phase synthetic pipeline for generating and validating underspecified long-horizon tasks.

Three-phase LHAW pipeline: (1) extract and rank critical prompt segments, (2) generate underspecified variants by selectively removing information, and (3) run empirical agent trials to classify tasks as outcome-critical, divergent, or benign for benchmark release.

First, segment extraction identifies atomic pieces of information in a task prompt that an agent depends on during execution. These segments are categorized by dimension (goal, constraint, input, or context) and scored for criticality and recoverability. This step establishes what information can be removed and how risky its removal is.

Table defining four ambiguity dimensions—Goal, Constraint, Input, and Context—with descriptions, indicators of underspecification, and example clarification questions for each.

Second, candidate generation produces underspecified versions of the original task by selectively removing or obscuring high-impact segments. Different removal strategies allow LHAW to control the severity and type of ambiguity while preserving the structure and intent of the original task.

Table showing an original formatting task prompt alongside three underspecification strategies—Delete, Vaguify, and Genericize—illustrating how specific instructions are progressively removed or made ambiguous.

Finally, empirical validation runs agents on each underspecified variant. Based on observed execution outcomes, variants are classified as:

Outcome-critical, where agents consistently fail without clarification,
Divergent, where outcomes vary across trials,
Benign, where agents reliably succeed despite missing information.
New task, where removing information changes the task itself rather than introducing ambiguity

We filter out variants classified as New task (e.g., deleting constraints yields a valid but different task). By grounding ambiguity in execution rather than linguistic intuition, LHAW produces benchmark-ready samples with explicit guarantees about how missing information affects agent behavior.

Crucially, LHAW is dataset-agnostic. The pipeline can be applied to any agent benchmark or internal workflow where tasks can be executed and scored using an evaluation harness. This enables researchers and practitioners to introduce and measure underspecification in their own settings without requiring benchmark-specific ambiguity annotations.

Benchmark Construction

We apply LHAW to three long-horizon agent benchmarks:

MCP-Atlas, which focuses on tool use across diverse MCP servers
TheAgentCompany (TAC), which measures agents acting as digital workers within a simulated software company
SWE-Bench Pro, which evaluates code repair on real-world software issues

Across these benchmarks, we generate 285 empirically validated underspecified task variants, each labeled by ambiguity class and additional execution metadata. For every variant, we use the same agent runner and evaluation harness as the original benchmark, ensuring comparability with prior results.

Results: Agent Behavior Under Ambiguity

Using the LHAW benchmark, we study how frontier models detect, reason about, and resolve underspecification. Our analysis focuses on four complementary questions: the value of information, the cost of information, failure modes of clarification, and the impact of agentic prompting strategies.

Value of Information

We first ask how much performance agents can recover when clarification is available. To measure this, we compare task success (Pass@3) and partial progress (Ckpt%) on well-specified tasks versus underspecified variants, both with and without access to a simulated user who can clarify task ambiguity based on the information that was edited. In the Score column, values in parentheses show the absolute gain from user interaction on underspecified tasks. This isolates how much of the performance lost under ambiguity can be recovered through interaction.

Table comparing model performance across MCP-Atlas, TAC, and SWE-Bench Pro, showing baseline metrics, clarification behavior (Ask%, Avg/Traj), and gains in Pass@3 and Ckpt% under underspecification.

Across all datasets, clarification recovers a significant fraction of lost Pass@3 and Ckpt%, but it does not fully restore baseline performance. This gap highlights that ambiguity in long-horizon workflows is not just a missing-input problem, but it can cascade into downstream failures once execution begins.

Performance patterns vary by benchmark:

Claude Opus-4.5 is strongest on MCP-ATLAS both before and after underspecification
On TheAgentCompany, Gemini-3-Flash and Opus-4.5 lead on well-specified tasks, while GPT-5.2 performs best once tasks become underspecified and clarification is available
On SWE-Bench Pro, Claude Sonnet-4.5 performs best on well-specified tasks, while GPT-5.2 benefits most from clarification, fully recovering its baseline performance under ambiguity.

While clarification meaningfully improves outcomes, models differ sharply in how they achieve these gains. Some recover performance with minimal user interaction, while others rely on frequent clarification to make progress. This motivates an efficiency-focused view of clarification behavior.

To capture this trade-off, we measure the value of information by normalizing performance gains by the number of questions asked. We report Ask%, Avg/Traj, and Gain/Q—the improvement in Pass@3 or Ckpt% per question.

These metrics reveal distinct clarification strategies. Claude Opus-4.5 and Gemini-3-Pro consistently achieve higher Gain/Q while engaging clarification more selectively, as reflected in lower Ask% across benchmarks. Gemini-3-Pro is especially efficient, invoking clarification in only a small fraction of trials while still recovering strong performance.

In contrast, GPT-5.2 asks far more frequently, achieving large absolute gains but substantially lower value per question. Overall, while all models benefit from clarification, the strongest agents are those that extract the most value from each interaction rather than relying on frequent interruption.

Cost of Information

Clarification is not free, and agents must adapt their behavior to different interaction-cost regimes. To study this, we vary the perceived cost of asking the user by introducing different user personas on MCP-Atlas, ranging from highly available supervisors to busy executives.

Table comparing clarification cost personas (Original, Supervisor, Standard, Executive), showing identical Ckpt% scores with varying question counts and gain per question.

We observe that increasing clarification cost leads to divergent, but intuitive, strategies. When interaction is cheap, agents ask more questions and achieve higher overall progress, but with lower efficiency per question. When interaction is expensive, agents ask fewer questions and extract more value per question, but accept a higher risk of failure. These results highlight the importance of evaluating clarification under realistic cost assumptions rather than assuming unlimited interaction.

Failure Modes of Clarifying Questions

When clarification fails, it does so in systematic ways. Using an LLM-based judge, we analyze the semantic failure modes of agent questions on MCP-Atlas. Common issues include poorly targeted or compound questions, failure to identify outcome-critical missing information, unnecessary clarification of benign details, and misinterpretation or underutilization of user responses.

Table analyzing clarification failure modes by model, reporting counts and rates for question quality, targeting, information integration, over- and under-clarification, timing strategy, and response misinterpretation.

These failures are not evenly distributed across models – question quality, overclarification and response misinterpretation vary the most across models. We also see a symmetry in how some models exhibit a strong tendency towards over-clarification, frequently asking unnecessary questions, while those that didn’t exhibit this behavior were much more likely to under-clarify, proceeding with incorrect assumptions. Notably, these patterns are amplified on failed trajectories, suggesting that question quality is a key bottleneck in recovering performance under underspecification.

Agentic Prompting Strategies

Finally, we examine how different agentic prompting strategies affect clarification behavior and ability to reason through ambiguity on TAC. Comparing approaches such as CodeAct, ReAct, Reflexion, and Plan-and-Execute, we find that no single strategy dominates across all settings.

Table comparing prompting strategies (CodeAct, ReAct, Reflexion, Plan & Execute) across overall and ambiguity classes, reporting Pass@3, Ckpt%, ask rate, and average trajectory length.

Simpler prompting strategies tend to perform best on aggregate success metrics, while more structured strategies perform better on harder, outcome-critical tasks, especially when clarification is available. However, more complex strategies also tend to ask more questions, sometimes interfering with exploration or introducing unnecessary interaction. This suggests a trade-off between structured reasoning and interaction efficiency that future agent designs must navigate.

What’s Next

As agents take on longer-horizon responsibilities, failures increasingly arise from acting under uncertainty rather than lacking core capabilities. LHAW provides a practical framework for:

evaluating strategic clarification under realistic ambiguity,
augmenting existing benchmarks with validated underspecification,
calibrating agents to different interaction-cost regimes,
and enabling post-training focused on reliability rather than raw performance.

Because LHAW is dataset-agnostic, its applicability extends beyond the benchmarks studied here. Researchers and practitioners can apply the pipeline to their own tasks and environments, tune ambiguity levels to deployment risk, and build agents that behave more reliably under real-world underspecification.