HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t Know

Earlier this year we released HiL-Bench (Human-in-Loop Benchmark), to measure how well a coding agent can ask for help when faced with underspecified problems. The gap between a fully specified and underspecified problem was large: agents with full information have an 80-90% pass@3, while agents with only partial information and an ask_human() tool top out at around 30%. In the accompanying paper we termed the common failure selective escalation: the ability to identify when necessary information cannot be ascertained from the current context and ask for help from a human before continuing implementation.

We presented results using the SWE-Agent harness. Since then, many harnesses and frameworks have been developed, including complete agentic ones such as Claude Code and Codex. As a result, selective escalation is no longer a model-only capability, but that of the entire agentic system. There will always be context in someone’s head: product intent, business norms, or the one thing the PM never wrote down.

Current agentic research work rewards autonomous problem solving. In deployment, where requirements are often vague, that autonomy has to be balanced with trustworthiness.

HiL-Dynamics

To unpack how different agent configurations ask, explore, and fail with underspecified tasks, we built a diagnostic tool: HiL-Dynamics. We ran five harnesses on HiL-Bench tasks and used HiL-Dynamics to study their trajectories: Claude-Code SDK with Claude Opus 4.7, Codex SDK with GPT 5.5, Antigravity/ADK with Gemini 3.1 Pro Preview, and OpenCode with GLM 5.1. We evaluated both the scenario of a user employing them straight out of the box and scenarios where a user makes targeted customizations to encourage maximal selective escalation.

HiL-Dynamics reveals three findings:

The judgment gap survives modern scaffolding. Stronger harnesses haven't taught agents when to ask.
Skill engineering is a real handle, but a harness-specific one. The skill tuning that lifts Codex from 7% to 53% pass@3 only takes Claude-Code from 3% to 15%.
Every {harness, model} has its own failure shape. Their optimal scaffolds diverge. There is no universal recipe.

View the HiL-Dynamics tool on Github.

Finding 1: The Performance Gap Still Exists

We first gather agent performance with the harnesses’ default system prompts and, if available, native question-asking tools. Claude-Code, Codex, and Antigravity all have these tools (AskUserQuestion, requestUserInput, and ask_question, respectively), while OpenCode does not. We provided OpenCode with our own custom MCP tool that mirrors HiL-Bench's original ask_human() tool.

selective escalation still fails under strong harness

We found that selective escalation remains difficult, even when the harness provides the means to receive help. OpenCode, with the MCP tool, asked very sparingly, reflecting what we saw in the HiL-Bench paper, that this is not yet a fully-trained model capability. The three agent SDKs that do have native asking tools didn’t use them much either, sometimes due to the scaffold severely discouraging or preventing escalation during implementation (as opposed to planning). As such, we believe teaching agents to interweave exploration, planning, asking, and implementing in one go is an important area of future work.

Finding 2: Skill Changes Can Lead to Stronger Strategies

We then used HiL-Dynamics to see agent performance with additional enhancements in an attempt to increase their asking and thus passing performance. This included:

Thorough selective escalation `skill.md` for all agents
Reinforced guidance in the system prompt for all agents
For Claude-Code, Codex, and Antigravity, the custom ask_human() MCP tool in addition to their native question-asking tool, hypothesizing that it would be free from the bounds of their native training or prompting restrictions

The results are much better, with all SDKs showing significant improvement. All agents ask much more frequently, and the three native-ask-tool agents are able to utilize the custom tool much more effectively than they do their native tool. This shows that agent SDKs can be incredibly sensitive to customizations, especially for a skill like selective escalation that isn't ingrained during training. Moreover, it suggests that agents are more eager to utilize custom user tools when provided, even if it overlaps with a tool or capability they have been trained with.

Beyond better performance, we uncovered interesting responses to different skill techniques. Providing guidance about when to ask for help bettered performance, as did guidance about how to formulate a relevant question. Repeated reminders with strong language about how the agents would fail to satisfy user desires if they don’t ask also yielded some performance boost.

However, one important note is that no one skill yielded the maximal improvement from the default harness baseline for all agents. A skill that excels on one harness can even degrade performance on another. For example, Claude-Code and Codex have almost opposite asking priors, with the latter being more open to asking questions.

One skill variant we tried was closing the codebase escape hatch in the gate: the baseline instruction allowed agents to skip asking whenever they believed the answer could be found in the codebase ("cannot resolve it from the codebase"), but we replaced that clause with a strict no-inference rule ("even implicit answers from the codebase are not good enough"), removing the agent's ability to use codebase patterns as justification for staying silent. This gave Claude-Code a +22% improvement in pass@3 as it stopped suppressing questions it would otherwise have self-resolved. For Codex, it was already asking near its maximum regardless of whether the self-resolve permission was present, so tightening this condition had nothing left to unlock.

Another variant introduced a stronger asking mandate (instruction language that forces asking through explicit "MUST ask" phrasing and failure framing, such as "you will fail this task if you do not ask"), combined with the same stricter gate. This drastically improved Codex's pass@3 by +630% (0.073→0.533) with a near-10x increase in average questions per pass (0.5→4.7), while Claude-Code produced a far more modest reaction; average questions per pass rose only +0.9 (vs Codex's +4.2), and pass@3 reached only 0.120, confirming that Claude modulates even strong mandate text against its inference priors.

It would seem that Codex is much more responsive to mandate settings, while Claude-Code relies a lot more on controlling its inference escape hatch. Our findings suggest that skill engineering should be calibrated per harness rather than applied uniformly. Nevertheless, even with the best performance we could achieve with careful enhancements, all agents still exhibit a substantial performance gap compared to when they are provided all information upfront.

Finding 3: Agents Show Different Problem-Solving Patterns

Now that we know the judgment gap extends beyond SWE-Agent into today's state-of-the-art harnesses, the next question is how and why. This is where HiL-Dynamics shines. Below we highlight some important findings from this tool.

Trustworthiness vs. Autonomy

To more directly assess how well these agentic systems selectively escalate, we use our original paper’s Ask-F1 metric. We break down the metric to Blocker Recall (how many blockers did the agent resolve) and Ask Precision (how many of the questions it asked were relevant). This gives a sense of how trustworthy an agentic system is (if there's a blocker, can I trust it to clarify?) and how autonomous it is (can it finish its work without bothering the user indiscriminately?). While harness variations can improve recall or precision substantially, all agent systems still struggle with blocker recall. They currently cannot be trusted to surface blockers. Asking well is not the problem. Knowing when to ask is.

These results reiterate our original findings from HiL-Bench: agents are still incapable of determining when they need to ask for help. Even when using what are generally accepted in the research and coding communities as strong harnesses, the shortcoming remains.

HiL-Dynamics also lets us look past aggregate pass rate results and assess how an agent moved through an underspecified task. Two agents may fail for different reasons: one may write before resolving blockers, while another fails to validate their patches.

One Harness, Different Model Strategies

On the original SWE-Agent runs, related model families show similar explore-ask-write shapes under a fixed harness. For example, GPT models ask early while Claude models do more early exploration before asking.

These strategic phenotypes, however, can change considerably when we change the harness. While SWE-agent consistently pushes GPT models to ask early, we find that Native Codex shifts GPT-5.5 towards more early exploration. With a tuned skill, GPT-5.5 still explores early, but asks more overall.

Terminal States Show Different Failure Anatomy

Failed AskHuman trajectories end in different deterministic terminal states. This is useful for diagnosing how systems fail after, before, or around the help-seeking step. We find that failures vary not only by the model, but also by the harness itself.

For GPT-5.5, the failure modes we see with SWE-agent differ greatly from those of Codex. SWE-agent runs do far less validation, while Codex often submits despite visible errors.

Takeaways

In deployment settings, human collaboration often means acquiring information from people and seeking clarification. Across our HiL-Dynamics experiments, agents consistently struggled with selective escalation regardless of harness or model.

However, each setup shows different approaches in balancing exploration with escalation, as well as different failure patterns, suggesting targeted areas of improvement for future generations of models and harnesses and enabling users to decide for themselves which setup best suits their needs.

Almost all problems encountered in real-world engineering situations will be underspecified; users frequently write vague problems and hold hidden assumptions or tribal knowledge. Agents need to do more than solve solo. They need to know when to ask for context hidden in people's heads. We hope HiL-Dynamics helps the community evaluate their models, harnesses, and customizations on underspecified problems.