[Blog]

Insights, analysis, and updates from Scale Labs on AI evaluation, benchmarks, and research.

Authors

Date Title

2026

16 posts

6/16/2026Insights Generator: Automated Failure Mode Analysis for AgentsAgents, Evaluation and AlignmentAkshay Manglik, Veronica Chatrath, Yuan (Emily) Xue

6/16/2026

Insights Generator: Automated Failure Mode Analysis for AgentsAgents, Evaluation and Alignment

6/4/2026Can Coding Agents Tackle Early-Stage Drug Discovery?Agents, Evaluation and Alignment, Reasoning, EnterpriseAfra Feyza Akyürek, Xinming Tu, Sofia Monasdotter, Yuanhao Qu, Sergey Chekhov, Sami Hassaan

6/4/2026

Can Coding Agents Tackle Early-Stage Drug Discovery?Agents, Evaluation and Alignment, Reasoning, Enterprise

5/27/2026HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t KnowAgents, Evaluation and AlignmentTu Trinh, Kelvin Luu, Weijun Luo, Matthew Siegel, Mohamed Elfeki

5/27/2026

HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t KnowAgents, Evaluation and Alignment

5/19/2026The Path to Large Scale Dense Video CaptioningMultimodal, Physical AI, Science of DataJade Choghari, Agustin Sansone, Nicolas Pasqualis, Conrado Mader, Aleks Tiupikov, Mouli Sivapurapu

5/19/2026

The Path to Large Scale Dense Video CaptioningMultimodal, Physical AI, Science of Data

5/11/202657 Healthcare Professionals Told Us What They Need from AIEvaluation and Alignment, EnterpriseSami Hassaan, Oscar Kavanagh, Matthew Siegel

5/11/2026

57 Healthcare Professionals Told Us What They Need from AIEvaluation and Alignment, Enterprise

5/6/2026Coverage Not Averages: Rethinking Retrieval EvaluationEvaluation and Alignment, EnterpriseAndrew Klearman, Radu Revutchi, Rohin Garg

5/6/2026

Coverage Not Averages: Rethinking Retrieval EvaluationEvaluation and Alignment, Enterprise

4/6/2026Improving Multi-Turn Tool Use with GRPO: Results and InsightsPost-Training, AgentsRazvan Dumitru, Chetan Rane, Sami Hassaan, Divyansh Agarwal

4/6/2026

Improving Multi-Turn Tool Use with GRPO: Results and InsightsPost-Training, Agents

3/23/2026MultiChallenge Update: A More Reliable Multi-Turn BenchmarkEvaluation and AlignmentVipul Gupta, Matthew Siegel, Marcos Ayestaran

3/23/2026

MultiChallenge Update: A More Reliable Multi-Turn BenchmarkEvaluation and Alignment

3/20/2026Voice Showdown: An In-the-Wild Preference Arena for Voice AIEvaluation and Alignment, MultimodalAdvait Gosai, Janie Gu, Bing Liu, Mohamed Elfeki

3/20/2026

Voice Showdown: An In-the-Wild Preference Arena for Voice AIEvaluation and Alignment, Multimodal

3/11/2026Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoAgents, Evaluation and AlignmentMohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He

3/11/2026

Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoAgents, Evaluation and Alignment

3/5/2026VeRO: Can AI Agents Build Better AI Agents?Agents, Evaluation and AlignmentVarun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton

3/5/2026

VeRO: Can AI Agents Build Better AI Agents?Agents, Evaluation and Alignment

3/4/2026When AI Safety Becomes a Denial‑of‑Service for DefendersSafety & Oversight, EnterpriseDavid Campbell

3/4/2026

When AI Safety Becomes a Denial‑of‑Service for DefendersSafety & Oversight, Enterprise

2/17/2026Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksAgents, Evaluation and Alignment, Science of DataGeorge Pu, Mike Lee, Sam Denton

2/17/2026

Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksAgents, Evaluation and Alignment, Science of Data

1/28/2026How Profession Shapes LLM Usage: Insights from Scale ShowdownEvaluation and Alignment, EnterpriseJanie Gu, Jaehwan Jeong, David Lee, Bing Liu, Zihao Wang

1/28/2026

How Profession Shapes LLM Usage: Insights from Scale ShowdownEvaluation and Alignment, Enterprise

1/23/2026MoReBench: Evaluating the Process of AI Moral ReasoningEvaluation and Alignment, Reasoning, Safety & Oversight Brandon Handoko, Matthew Siegel, Mike Lee

1/23/2026

MoReBench: Evaluating the Process of AI Moral ReasoningEvaluation and Alignment, Reasoning, Safety & Oversight

1/12/2026Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsPost-Training, AgentsNiklas Lauffer

1/12/2026

Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsPost-Training, Agents

2025

1 post

11/17/2025Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsPost-Training, Agents, EnterpriseJerry Chan, Vijay Kalmath, George Pu, Sam Denton

11/17/2025

Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsPost-Training, Agents, Enterprise

17 posts found