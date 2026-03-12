Insights, analysis, and updates from Scale Labs on AI evaluation, benchmarks, and research.

Research 11. 03 2026

Agentic Rubrics: Teaching AI to Verify Code the Way Developers Do

Agentic Rubrics is a method for verifying AI-generated code fixes. An agent explores the repo, writes a checklist for what a correct patch should do, and uses that rubric to score candidate fixes.

Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He

Research 05. 03 2026

VeRO: Can AI Agents Build Better AI Agents?

VeRO benchmarks whether coding agents can improve other AI agents by modifying their prompts, tools, and control logic. Across 105 optimization runs, results show modest gains on tool-use tasks but persistent limits in exploration, cross-model generalization, and deeper architectural changes.

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton

Research 04. 03 2026

When AI Safety Becomes a Denial‑of‑Service for Defenders

Most AI safety benchmarks measure whether models help when they shouldn’t. But what happens when they refuse when they shouldn’t? An analysis of real-world defender interactions reveals how alignment systems can block legitimate cybersecurity work—exposing a blind spot in how AI safety is currently evaluated.

David Campbell

Research 17. 02 2026

Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon Tasks

LHAW is a dataset-agnostic pipeline for generating underspecified long-horizon tasks and evaluating strategic clarification. Across MCP-Atlas, TAC, and SWE-Bench Pro, we find large differences in how frontier models detect missing information and recover performance under ambiguity.

George Pu, Mike Lee, Sam Denton

Showdown 28. 01 2026

How Profession Shapes LLM Usage: Insights from SEAL Showdown

We analyze 580k+ production prompts and 100k+ preference battles from SEAL Showdown to study how profession shapes LLM usage. We find that professional background—independent of topic—predicts prompt difficulty, task type, and model preference, with domain experts asking harder in-domain questions and ranking models differently. These results motivate profession-aware evaluation of LLMs in expert workflows.

Janie Gu, Jaehwan Jeong, David Lee, Bing Liu, Zihao Wang

Safety 23. 01 2026

MoReBench: Evaluating the Process of AI Moral Reasoning

MoReBench is a benchmark designed to evaluate the procedural moral reasoning of large language models. Using expert-authored rubrics across diverse ethical scenarios, it scores models on the structure and coherence of their reasoning rather than task outcomes. Our findings show that moral reasoning remains weakly correlated with established benchmarks and warrants targeted evaluation and training.

Brandon Handoko, Matthew Siegel, Mike Lee

Research 12. 01 2026

Training Robust Multi-Turn LM Agents with On-Policy Expert Corrections

In our recent work, Imitation Learning for Multi-Turn LM Agents via On-Policy Expert Corrections, we expose the problem of covariate shift in SWE LM agents and propose a simple, practical fix that significantly improves training efficiency and agent robustness.

Niklas Lauffer

Research 17. 11 2025

Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback Loops

We demonstrate that reinforcement learning can be used to fine-tune agents within realistic enterprise environments, leveraging task-specific feedback and structured rewards to substantially improve performance metrics compared to baseline models.

Jerry Chan, Vijay Kalmath, George Pu, Sam Denton