[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]

[Papers]

Research papers and publications from Scale Labs covering AI evaluation, safety, benchmarking, and frontier model analysis.

Date Title

02.12.2026LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and AlignmentGeorge Pu* , Michael S. Lee* , Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, and Samuel Marc Denton *Indicates equal contribution 01.15.2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and AlignmentUdari Madhushani Sehwag1, Elaine Lau1†, Haniyeh Ehsani Oskouie2,5, Shayan Shabihi3, Erich Liang4,5, Andrea Toledo1, Guillermo Mangialardi1, Sergio Fonrouge1, Ed-Yeremai Hernández Cardona1, Paula Vergara1, Utkarsh Tyagi1, Chen Bo Calvin Zhang1, Pavi Bhatter1, Nicholas Johnson1, Furong Huang3, Ernesto Gabriel Hernández Montoya1, and Bing Liu1 1Scale AI, 2University of California, Los Angeles, 3University of Maryland, 4Princeton University, 5Human Frontier Collective, Scale AI †Work done while at Scale AI 01.06.2026Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and AlignmentMohit Raghavendra*, Anisha Gunjal*, Bing Liu, Yunzhong He *Equal contribution.12.22.2025MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and AlignmentYu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine 12.18.2025MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and AlignmentChaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, Bing Liu 12.17.2025Audio MultiChallengeMultimodal, Safety, Evaluation and AlignmentAdvait Gosai*, Tyler Vuong*, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, Yunzhong He *Indicates equal contribution 11.25.2025PropensityBenchSafety, Evaluation and AlignmentUdari Madhushani Sehwag1* , Shayan Shabihi2* , Alex McAvoy3 , Vikash Sehwag4 , Yuancheng Xu5, Dalton Towers6 , Furong Huang2 1Scale AI, 2University of Maryland, College Park, 3University of North Carolina at Chapel Hill, 4Google DeepMind, 5Netflix, 6University of Texas at Austin * Equal Contributions 11.13.2025Professional Reasoning BenchmarkSafety, Evaluation and Alignment, ReasoningAfra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, and Yunzhong He 11.10.2025ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and AlignmentManasi Sharma1, Chen Bo Calvin Zhang1, Chaithanya Bandi1, Clinton Wang†, Ankit Aich1, Huy Nghiem2, Tahseen Rabbani3, Ye Htet4, Brian Jang1 , Sumana Basu5 , Aishwarya Balwani1, Denis Peskoff6 , Marcos Ayestaran1 , Sean M. Hendryx†, Brad Kenstler1, Bing Liu1 1Scale AI, 2University of Maryland, 3University of Chicago, 4Washington University, St. Louis, 5McGill University, 6University of California, Berkeley †Work conducted while at Scale AI 11.05.2025Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and AlignmentBoyi Wei1, 2∗† , Zora Che1, 3∗† , Nathaniel Li1†, Udari Madhushani Sehwag1 , Jasper Götting4 , Samira Nedungadi4 , Julian Michael1†, Summer Yue1†, Dan Hendrycks5 , Peter Henderson2 , Zifan Wang1†, Seth Donoughe4 , Mantas Mazeika5 1Scale AI, 2Princeton University, 3University of Maryland, 4SecureBio, 5Center for AI Safety ∗ Equal Contributions, † Work done while at Scale AI 10.28.2025Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, ReasoningMantas Mazeika∗1 , Alice Gatti∗1 , Cristina Menghini∗† , Udari Madhushani Sehwag∗2 , Shivam Singhal∗†, Yury Orlovskiy∗1, Steven Basart1 , Manasi Sharma2 , Denis Peskoff2 , Elaine Lau2 , Sumana Basu2 , Jaehyuk Lim1 , Lachlan Carroll1 , Alice Blair1 , Vinaya Sivakumar1 , Brad Kenstler2 , Yuntao Ma† , Julian Michael† , Xiaoke Li1 , Oliver Ingebretsen1 , Aditya Mehta1 , Jean Mottola1 , John Teichmann‡ , Kevin Yu‡ , Zaina Shaik‡ , Adam Khoja1 , Richard Ren1 , Jason Hausenloy1 , Long Phan1 , Connor Smith1 , Ye Htet2 , Ankit Aich2 , Tahseen Rabbani2 , Vivswan Shah† , Andriy Novykov1 , Felix Binder† Kirill Chugunov2 , Luis Ramirez2 , Matias Geralnik2 , Hernán Mesura2 , Dean Lee2 , Ed-Yeremai Hernandez Cardona2 , Annette Diamond2 Summer Yue**†, Alexandr Wang**†, Bing Liu**2, Ernesto Hernandez**2 , Dan Hendrycks**1 1Center for AI Safety 2Scale AI *Equal contribution **Senior authors †Work done while at Scale AI ‡Work done while at CAIS 10.20.2025REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and AlignmentZafir Stojanovski1∗, Oliver Stanley1,2∗, Joe Sharratt1∗,Richard Jones1∗,Abdulhakeem Adefioye1, Jean Kaddour3† Andreas Köpf1† 1Open-Thought, 2Scale AI, 3University College London 10.15.2025Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, MultimodalXingang Guo1,2 , Utkarsh Tyagi1 , Advait Gosai1 , Paula Vergara1 , Ernesto Gabriel Hernandez Montoya1 , Chen Bo Calvin Zhang1 , Bin Hu2 , Yunzhong He1 , Bing Liu1 , Rakshith Sharma Srinivasa1 1Scale AI, 2University of Illinois at Urbana-Champaign 10.08.2025Online Rubrics Elicitation from Pairwise ComparisonsSafety, Evaluation and Alignment, Post-TrainingMohammadHossein Rezaei1,2,∗, Robert Vacareanu1, Zihao Wang1, Clinton Wang1, Bing Liu1, Yunzhong He1 , and Afra Feyza Akyürek1 1Scale AI, 2University of Arizona *Work done during internship at Scale AI 09.25.2025Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of DataJunkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin 09.23.2025Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and AlignmentAlwin Jin, Sean M. Hendryx, Vaskar Nath 09.19.2025SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and AlignmentXiang Deng*, Jeff Da*, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, Brad Kenstler *Co-first author and equal contributions.09.11.2025TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and AlignmentRakshith S Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, Chen Xing 08.26.2025Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, OversightNeil Kale1, 2, †, Chen Bo Calvin Zhang1, *, Kevin Zhu1, 3, †, *, Ankit Aich1 , Paula Rodriguez1 , Scale Red Team1 , Christina Q. Knight1 , and Zifan Wang1 1Scale AI, 2Carnegie Mellon University, 3Massachusetts Institute of Technology * Equal Contributions, †Work done during internship at Scale AI 08.13.2025Search-Time Data ContaminationSafety, Evaluation and Alignment, OversightZiwen Han, Meher Mankikar, Julian Michael, and Zifan Wang 07.23.2025MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and AlignmentAlexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing 07.23.2025Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-TrainingAnisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, Sean Hendryx 07.21.2025WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and AlignmentBoyuan Zheng1, Zeyi Liao1, Scott Salisbury1, Zeyuan Liu1, Michael Lin1, Qinyuan Zheng1, Zifan Wang2, Xiang Deng2, Dawn Song3, Huan Sun1, Yu Su1 1The Ohio State University 2Scale AI 3University of California, Berkeley 07.15.2025Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and AlignmentTomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik 06.28.2025Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, ReasoningMiles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael 06.18.2025FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and AlignmentChristina Q. Knight∗, Kaustubh Deshpande⋄, Ved Sirdeshmukh⋄, Meher Mankikar, Scale Red Team, SEAL Research Team, and Julian Michael ∗ Project Lead, ⋄ Equal Contribution 06.16.2025Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoningVaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, Sean Hendryx 06.13.2025Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, ReasoningJeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, Sean Hendryx 06.05.2025A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and AlignmentZifan Wang, Christina Q. Knight, Jeremy Kritz, Willow E. Primack, Julian Michael 05.09.2025Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of DataJulia Shuieh, Prasann Singhal, Apaar Shanker, John Heyer, George Pu, Samuel Denton 03.14.2025Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoningWill LeVine, Bijan Varjavand 03.08.2025Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and AlignmentBenjamin Jensen, Ian Reynolds, Yasir Atalan, Michael Garcia, Austin Woo, Anthony Chen, Trevor Howarth 03.05.2025The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and AlignmentRichard Ren∗1, Arunim Agarwal∗1, Mantas Mazeika∗1, Cristina Menghini∗2, Robert Vacareanu2, Brad Kenstler2, Mick Yang1, Isabelle Barrass1, Alice Gatti1, Xuwang Yin1, Eduardo Trevino2, Matias Geralnik2, Adam Khoja1,Dean Lee2, Summer Yue2, Dan Hendrycks1 1Center for AI Safety 2Scale AI *Equal contribution.02.13.2025ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and AlignmentClinton J. Wang1, Dean Lee1 , Cristina Menghini1, Johannes Mols1, Jack Doughty1, Adam Khoja2, Jayson Lynch3, Sean Hendryx1, Summer Yue1, Dan Hendrycks2 1Scale AI, 2Center for AI Safety, 3MIT 02.11.2025J2: Jailbreaking to JailbreakSafety, Evaluation and AlignmentJeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang 02.10.2025ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and AlignmentYibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen Xing 01.29.2025MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, ReasoningVed Sirdeshmukh*, Kaustubh Deshpande*, Johannes Mols*, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, Chen Xing *Indicates Equal Contribution 01.23.2025Humanity's Last ExamSafety, Evaluation and Alignment, ReasoningLong Phan∗1 , Alice Gatti∗1 , Ziwen Han∗2 , Nathaniel Li∗1 , Josephina Hu2 , Hugh Zhang‡, Sean Shi2, Michael Choi2, Anish Agrawal2, Arnav Chopra2, Adam Khoja1, Ryan Kim†, Richard Ren1, Jason Hausenloy1, Oliver Zhang1 , Mantas Mazeika1 , Summer Yue∗∗2 , Alexandr Wang∗∗2 , Dan Hendrycks∗∗1 1 Center for AI Safety, 2 Scale AI ∗Co-first Authors. ∗∗ Senior Authors. † Work conducted while at Center for AI Safety. ‡ Work conducted while at Scale AI. Refer to PDF for full list of Dataset Contributors.01.02.2025ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, OversightVaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx 10.11.2024Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and AlignmentPriyanshu Kumar 1, Elaine Lau 3, Saranya Vijayakumar 1, Tu Trinh 3, Scale Red Team 3, Elaine Chang 3, Vaughn Robinson 3, Sean Hendryx 3, Shuyan Zhou 1, Matt Fredrikson 1, 2, Summer Yue 3, Zifan Wang 3 1 Carnegie Mellon University, 2 GraySwan Al, 3 Scale Al 09.29.2024Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of DataYung-Chieh Chan∗, George Pu∗, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam Denton ∗Denotes equal contribution. Work was done while Yung-Chieh was interning at Scale AI.09.27.2024Revisiting the Superficial Alignment HypothesisPost-TrainingMohit Raghavendra°1 , Vaskar Nath2 , Sean Hendryx2 1Georgia Institute of Technology, 2Scale AI, °Work conducted while at Scale AI 09.05.2024Planning In Natural Language Improves LLM Search For Code GenerationPost-TrainingEvan Wang 1, 2, Federico Cassano o3,4, Catherine Wu o, Yunfeng Bai 1, Will Song 1, Vaskar Nath 1, Ziwen Han 1, Sean Hendryx 1, Summer Yue 1, Hugh Zhang 1 1 Scale AI , 2 California Institute of Technology, 3 Northeastern University, 4 Cursor AI, o Work conducted while at Scale AI 08.30.2024Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of DataSpencer Whitehead, Jacob Phillips, Sean Hendryx 08.27.2024LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and AlignmentNathaniel Li 1, 2, Ziwen Han 1, Ian Steneker 1, Willow Primack 1, Riley Goodside 1, Hugh Zhang 1, Zifan Wang 1, Cristina Menghini 1, Summer Yue 1 1 Scale AI , 2 UC Berkeley 07.18.2024Learning Goal-Conditioned Representations for Language Reward ModelsPost-TrainingVaskar Nath∗†, Dylan Slack∗, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer Whitehead‡, Sean Hendryx‡ ∗Equal contribution †Corresponding author: [email protected] ‡Equal senior authorship 05.01.2024A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and AlignmentHugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele (Mike) Lunati†, Summer Yue†03.05.2024The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-TrainingNathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks 01.22.2024Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer VisionWill LeVine, Benjamin Pikus, Jacob Phillips, Berk Norman, Fernando Amat Gil, Sean Hendryx 11.21.2023A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-TrainingWill LeVine, Benjamin Pikus, Anthony Chen, Sean Hendryx 10.05.2023A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and AlignmentDylan Slack*, Jean Wang*, Denis Semenenko*, Kate Park, Sean Hendryx *Equal Contribution 10.04.2023On the Performance of Multimodal Language ModelsMultimodal, Post-TrainingUtsav Garg, Erhan Bas 04.28.2023Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-TrainingGeorge Pu, Anirudh Jain, Jihan Yin, Russell Kaplan 04.11.2023Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer VisionAnisha Gunjal*, Jihan Yin*, Erhan Bas† *These authors contributed equally. †Work done at Scale AI.03.11.2023Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer VisionWill Levine† , Benjamin Pikus† , Pranav Raja & Fernando Amat Gil † denotes equal contribution 01.29.2023Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and AlignmentYatong Bai, Brendon G. Anderson, Aerin Kim, Somayeh Sojoudi 03.07.2022GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer VisionKareem Metwaly, Aerin Kim, Elliot Branson, Vishal Monga 11.16.2021CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer VisionKareem Metwaly, Aerin Kim, Elliot Branson, Vishal Monga 11.07.2021Natural Adversarial ObjectsComputer VisionFelix Lau, Nishant Subramani, Sasha Harrison, Aerin Kim, Elliot Branson, Rosanne Liu 10.11.2021DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and AlignmentJohn Pougué-Biyong*, Valentina Semenova*, Alexandre Matton, Rachel Han, Aerin Kim, Renaud Lambiotte, J. Doyne Farmer *Equal contribution 07.31.2021On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of DataZeyad Emam1 2, Andrew Kondrich 1, Sasha Harrison 1, Felix Lau 1, Yushi Wang 1, Aerin Kim 1, Elliot Branson 1 04.20.2021Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer VisionMatthew Groh, Caleb Harris, Luis Soenksen, Felix Lau, Rachel Han, Aerin Kim, Arash Koochek, Omar Badri 11.27.2020A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer VisionNishant Subramani, Alexandre Matton, Malcolm Greaves, Adrian Lam 02.12.2026

LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment

SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment

Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and Alignment

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and Alignment

Audio MultiChallengeMultimodal, Safety, Evaluation and Alignment

PropensityBenchSafety, Evaluation and Alignment

Professional Reasoning BenchmarkSafety, Evaluation and Alignment, Reasoning

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and Alignment

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and Alignment

Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, Reasoning

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and Alignment

Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, Multimodal

Online Rubrics Elicitation from Pairwise ComparisonsSafety, Evaluation and Alignment, Post-Training

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of Data

Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and Alignment

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and Alignment

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and Alignment

Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, Oversight

Search-Time Data ContaminationSafety, Evaluation and Alignment, Oversight

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and Alignment

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-Training

WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and Alignment

Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and Alignment

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, Reasoning

FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and Alignment

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoning

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, Reasoning

A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and Alignment

Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of Data

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoning

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and Alignment

The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and Alignment

ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and Alignment

J2: Jailbreaking to JailbreakSafety, Evaluation and Alignment

ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and Alignment

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, Reasoning

Humanity's Last ExamSafety, Evaluation and Alignment, Reasoning

ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, Oversight

Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and Alignment

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of Data

Revisiting the Superficial Alignment HypothesisPost-Training

Planning In Natural Language Improves LLM Search For Code GenerationPost-Training

Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of Data

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and Alignment

Learning Goal-Conditioned Representations for Language Reward ModelsPost-Training

A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and Alignment

The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-Training

Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer Vision

A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-Training

A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and Alignment

On the Performance of Multimodal Language ModelsMultimodal, Post-Training

Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-Training

Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer Vision

Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer Vision

Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and Alignment

GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer Vision

CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer Vision

Natural Adversarial ObjectsComputer Vision

DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and Alignment

On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of Data

Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer Vision

A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer Vision

LHAW: Controllable Underspecification for Long-Horizon Tasks

SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?

Agentic Rubrics as Contextual Verifiers for SWE Agents

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Audio MultiChallenge

PropensityBench

Professional Reasoning Benchmark

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Remote Labor Index: Measuring AI Automation of Remote Work

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and Reasoning

Online Rubrics Elicitation from Pairwise Comparisons

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Progress over Points: Reframing LM Benchmarks Around Scientific Objectives

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Reliable Weak-to-Strong Monitoring of LLM Agents

Search-Time Data Contamination

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

WebGuard: Building a Generalizable Guardrail for Web Agents

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

A Red Teaming Roadmap Towards System-Level Safety

Assessing Robustness to Spurious Correlations in Post-Training Language Models

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning Challenges

J2: Jailbreaking to Jailbreak

ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Humanity's Last Exam

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Revisiting the Superficial Alignment Hypothesis

Planning In Natural Language Improves LLM Search For Code Generation

Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Learning Goal-Conditioned Representations for Language Reward Models

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy

A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift

A Holistic Approach For Test And Evaluation Of Large Language Models

On the Performance of Multimodal Language Models

Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs

Detecting and Preventing Hallucinations in Large Vision Language Models

Enabling Calibration In The Zero-shot Inference Of Large Vision-Language Models

Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing

GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction

CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous Vehicles

Natural Adversarial Objects

DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debates

On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning Models

Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset

A Survey of Deep Learning Approaches for OCR and Document Understanding

LHAW: Controllable Underspecification for Long-Horizon Tasks

George Pu* , Michael S. Lee* , Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, and Samuel Marc Denton *Indicates equal contribution

63 papers found

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.