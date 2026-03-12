Research Mar 11, 2026

Agentic Rubrics: Teaching AI to Verify Code the Way Developers Do

Agentic Rubrics is a method for verifying AI-generated code fixes. An agent explores the repo, writes a checklist for what a correct patch should do, and uses that rubric to score candidate fixes.

Research Mar 5, 2026

VeRO: Can AI Agents Build Better AI Agents?

VeRO benchmarks whether coding agents can improve other AI agents by modifying their prompts, tools, and control logic. Across 105 optimization runs, results show modest gains on tool-use tasks but persistent limits in exploration, cross-model generalization, and deeper architectural changes.

Research Mar 4, 2026

When AI Safety Becomes a Denial‑of‑Service for Defenders

Most AI safety benchmarks measure whether models help when they shouldn’t. But what happens when they refuse when they shouldn’t? An analysis of real-world defender interactions reveals how alignment systems can block legitimate cybersecurity work—exposing a blind spot in how AI safety is currently evaluated.

Research Feb 17, 2026

Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon Tasks

LHAW is a dataset-agnostic pipeline for generating underspecified long-horizon tasks and evaluating strategic clarification. Across MCP-Atlas, TAC, and SWE-Bench Pro, we find large differences in how frontier models detect missing information and recover performance under ambiguity.

