Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy

[Blog]

Insights, analysis, and updates from Scale Labs on AI evaluation, benchmarks, and research.

Authors
Date Title
2026
16 posts
6/16/2026Insights Generator: Automated Failure Mode Analysis for AgentsAgents, Evaluation and AlignmentAkshay Manglik, Veronica Chatrath, Yuan (Emily) Xue
6/16/2026
Insights Generator: Automated Failure Mode Analysis for AgentsAgents, Evaluation and Alignment
6/4/2026Can Coding Agents Tackle Early-Stage Drug Discovery?Agents, Evaluation and Alignment, Reasoning, EnterpriseAfra Feyza Akyürek, Xinming Tu, Sofia Monasdotter, Yuanhao Qu, Sergey Chekhov, Sami Hassaan
6/4/2026
Can Coding Agents Tackle Early-Stage Drug Discovery?Agents, Evaluation and Alignment, Reasoning, Enterprise
5/27/2026HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t KnowAgents, Evaluation and AlignmentTu Trinh, Kelvin Luu, Weijun Luo, Matthew Siegel, Mohamed Elfeki
5/27/2026
HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t KnowAgents, Evaluation and Alignment
5/19/2026The Path to Large Scale Dense Video CaptioningMultimodal, Physical AI, Science of DataJade Choghari, Agustin Sansone, Nicolas Pasqualis, Conrado Mader, Aleks Tiupikov, Mouli Sivapurapu
5/19/2026
The Path to Large Scale Dense Video CaptioningMultimodal, Physical AI, Science of Data
5/11/202657 Healthcare Professionals Told Us What They Need from AIEvaluation and Alignment, EnterpriseSami Hassaan, Oscar Kavanagh, Matthew Siegel
5/11/2026
57 Healthcare Professionals Told Us What They Need from AIEvaluation and Alignment, Enterprise
5/6/2026Coverage Not Averages: Rethinking Retrieval EvaluationEvaluation and Alignment, EnterpriseAndrew Klearman, Radu Revutchi, Rohin Garg
5/6/2026
Coverage Not Averages: Rethinking Retrieval EvaluationEvaluation and Alignment, Enterprise
4/6/2026Improving Multi-Turn Tool Use with GRPO: Results and InsightsPost-Training, AgentsRazvan Dumitru, Chetan Rane, Sami Hassaan, Divyansh Agarwal
4/6/2026
Improving Multi-Turn Tool Use with GRPO: Results and InsightsPost-Training, Agents
3/23/2026MultiChallenge Update: A More Reliable Multi-Turn BenchmarkEvaluation and AlignmentVipul Gupta, Matthew Siegel, Marcos Ayestaran
3/23/2026
MultiChallenge Update: A More Reliable Multi-Turn BenchmarkEvaluation and Alignment
3/20/2026Voice Showdown: An In-the-Wild Preference Arena for Voice AIEvaluation and Alignment, MultimodalAdvait Gosai, Janie Gu, Bing Liu, Mohamed Elfeki
3/20/2026
Voice Showdown: An In-the-Wild Preference Arena for Voice AIEvaluation and Alignment, Multimodal
3/11/2026Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoAgents, Evaluation and AlignmentMohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He
3/11/2026
Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoAgents, Evaluation and Alignment
3/5/2026VeRO: Can AI Agents Build Better AI Agents?Agents, Evaluation and AlignmentVarun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton
3/5/2026
VeRO: Can AI Agents Build Better AI Agents?Agents, Evaluation and Alignment
3/4/2026When AI Safety Becomes a Denial‑of‑Service for DefendersSafety & Oversight, EnterpriseDavid Campbell
3/4/2026
When AI Safety Becomes a Denial‑of‑Service for DefendersSafety & Oversight, Enterprise
2/17/2026Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksAgents, Evaluation and Alignment, Science of DataGeorge Pu, Mike Lee, Sam Denton
2/17/2026
Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksAgents, Evaluation and Alignment, Science of Data
1/28/2026How Profession Shapes LLM Usage: Insights from Scale ShowdownEvaluation and Alignment, EnterpriseJanie Gu, Jaehwan Jeong, David Lee, Bing Liu, Zihao Wang
1/28/2026
How Profession Shapes LLM Usage: Insights from Scale ShowdownEvaluation and Alignment, Enterprise
1/23/2026MoReBench: Evaluating the Process of AI Moral ReasoningEvaluation and Alignment, Reasoning, Safety & Oversight Brandon Handoko, Matthew Siegel, Mike Lee
1/23/2026
MoReBench: Evaluating the Process of AI Moral ReasoningEvaluation and Alignment, Reasoning, Safety & Oversight
1/12/2026Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsPost-Training, AgentsNiklas Lauffer
1/12/2026
Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsPost-Training, Agents
2025
1 post
11/17/2025Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsPost-Training, Agents, EnterpriseJerry Chan, Vijay Kalmath, George Pu, Sam Denton
11/17/2025
Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsPost-Training, Agents, Enterprise

17 posts found