Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]

[Papers]

Research papers and publications from Scale Labs covering AI evaluation, safety, benchmarking, and frontier model analysis.

Date Title
4/13/2026HiL-BENCH (Human-in-Loop Benchmark)Mohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Hernández, Nandan Marwaha, Yannis Yiming He, Charles Wang, Fernando Carabedo, Alessa Castillo, Bing LiuResearch
3/12/2026Defensive Refusal Bias: How Safety Alignment Fails Cyber DefendersDavid Campbell, Neil Kale, Udari Madhushani Sehwag, Bert Herring, Nick Price, Dan Borges, Alex Levinson, Christina Q. KnightSafety
2/26/2026LLM Novice Uplift on Dual-Use, In Silico Biology TasksChen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian MichaelSafety
2/25/2026VeRO: An Evaluation Harness for Agents to Optimize AgentsVarun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Sam DentonAgents, Post-Training, Evaluation and Alignment
2/12/2026LHAW: Controllable Underspecification for Long-Horizon TasksGeorge Pu, Michael S. Lee, Udari Madhushani Sehwag, David Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Sam DentonAgents, Safety, Evaluation and Alignment
1/15/2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Udari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernández Cardona, Paula Vergara, Utkarsh Tyagi, Chen Bo Calvin Zhang, Pavi Bhatter, Nicholas Johnson, Furong Huang, Ernesto Gabriel Hernández Montoya, Bing LiuSafety, Evaluation and Alignment
1/6/2026Agentic Rubrics as Contextual Verifiers for SWE AgentsMohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong HeAgents, Safety, Evaluation and Alignment
12/22/2025MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesYu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q. Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney LevineReasoning, Safety, Evaluation and Alignment
12/18/2025MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersChaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, Bing LiuAgents, Reasoning, Safety, Evaluation and Alignment
12/17/2025Audio MultiChallengeAdvait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, Yunzhong HeMultimodal, Safety, Evaluation and Alignment
11/25/2025PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic ApproachUdari Madhushani Sehwag, Shayan Shabihi, Alex McAvoy, Vikash Sehwag, Yuancheng Xu, Dalton Towers, Furong HuangSafety, Evaluation and Alignment
11/13/2025Professional Reasoning BenchAfra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, Yunzhong HeSafety, Evaluation and Alignment, Reasoning
11/10/2025ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsManasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton J. Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing LiuReasoning, Safety, Evaluation and Alignment
11/5/2025Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsBoyi Wei, Zora Che, Nathaniel Li, Udari Madhushani Sehwag, Jasper Götting, Samira Nedungadi, Julian Michael, Summer Yue, Dan Hendrycks, Peter Henderson, Zifan Wang, Seth Donoughe, Mantas MazeikaSafety, Evaluation and Alignment
10/28/2025Remote Labor Index: Measuring AI Automation of Remote WorkMantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Sumana Basu, Jaehyuk Lim, Lachlan Carroll, Alice Blair, Vinaya Sivakumar, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu, Zaina Shaik, Adam Khoja, Richard Ren, Jason Hausenloy, Long Phan, Connor Smith, Ye Htet, Ankit Aich, Tahseen Rabbani, Vivswan Shah, Andriy Novykov, Felix Binder, Kirill Chugunov, Luis Ramirez, Matias Geralnik, Hernán Mesura, Dean Lee, Ed-Yeremai Hernández Cardona, Annette DiamondAgents, Safety, Evaluation and Alignment, Reasoning
10/20/2025REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsZafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas KöpfReasoning, Agents, Safety, Evaluation and Alignment
10/15/2025Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningXingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, Rakshith Sharma SrinivasaSafety, Evaluation and Alignment, Reasoning, Multimodal
10/8/2025Online Rubrics Elicitation: Dynamically Curating Evaluation Criteria via Pairwise Comparisons for LLM Post-TrainingMohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton J. Wang, Bing Liu, Yunzhong He, Afra Feyza AkyürekSafety, Evaluation and Alignment, Post-Training
9/25/2025Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingJunkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng JinPost-Training, Science of Data
9/23/2025Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesAlwin Jin, Sean M. Hendryx, Vaskar NathSafety, Evaluation and Alignment
9/19/2025SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, Brad KenstlerAgents, Safety, Evaluation and Alignment
9/11/2025TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsRakshith Sharma Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernández Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, Chen XingSafety, Evaluation and Alignment
8/26/2025Reliable Weak-to-Strong Monitoring of LLM AgentsNeil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Scale Red Team, Christina Q. Knight, Zifan WangSafety, Evaluation and Alignment, Oversight
8/13/2025Search-Time Data ContaminationZiwen Han, Meher Mankikar, Julian Michael, Zifan WangSafety, Evaluation and Alignment, Oversight
7/23/2025MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsAlexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen XingReasoning, Safety, Evaluation and Alignment
7/23/2025Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsAnisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, Sean M. HendryxScience of Data, Post-Training
7/21/2025WebGuard: Building a Generalizable Guardrail for Web AgentsBoyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu SuAgents, Safety, Evaluation and Alignment
7/15/2025Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyTomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad MikulikReasoning, Oversight, Safety, Evaluation and Alignment
6/28/2025Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningMiles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian MichaelPost-Training, Reasoning
6/18/2025FORTRESS: Frontier Risk Evaluation for National Security and Public SafetyChristina Q. Knight, Kaustubh Deshpande, Ved Sirdeshmukh, Meher Mankikar, Scale Red Team, SEAL Research Team, Julian MichaelSafety, Evaluation and Alignment
6/16/2025Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsVaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, Sean M. HendryxReasoning
6/13/2025Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsJeff Da, Clinton J. Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, Sean M. HendryxAgents, Post-Training, Reasoning
6/5/2025A Red Teaming Roadmap Towards System-Level SafetyZifan Wang, Christina Q. Knight, Jeremy Kritz, Willow E. Primack, Julian MichaelSafety, Evaluation and Alignment
5/9/2025Assessing Robustness to Spurious Correlations in Post-Training Language ModelsJulia Shuieh, Prasann Singhal, Apaar Shanker, John Heyer, George Pu, Sam DentonPost-Training, Science of Data
3/14/2025Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingWill LeVine, Bijan VarjavandReasoning
3/8/2025Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsBenjamin Jensen, Ian Reynolds, Yasir Atalan, Michael Garcia, Austin Woo, Anthony Chen, Trevor HowarthSafety, Evaluation and Alignment
3/5/2025The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsRichard Ren, Arunim Agarwal, Mantas Mazeika, Cristina MenghiniSafety, Evaluation and Alignment
2/13/2025ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesClinton J. Wang, Dean Lee, Cristina Menghini, Johannes Mols, Jack Doughty, Adam Khoja, Jayson Lynch, Sean M. Hendryx, Summer Yue, Dan HendrycksReasoning, Safety, Evaluation and Alignment
2/11/2025J2: Jailbreaking to JailbreakJeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan WangSafety, Evaluation and Alignment
2/10/2025ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsYibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen XingSafety, Evaluation and Alignment
1/29/2025MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsVed Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Hernández Cardona, Dean Lee, Jeremy Kritz, Willow E. Primack, Summer Yue, Chen XingSafety, Evaluation and Alignment, Reasoning
1/23/2025Humanity's Last ExamLong Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Summer Yue, Alexandr Wang, Dan HendrycksSafety, Evaluation and Alignment, Reasoning
1/2/2025ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkVaskar Nath, Pranav Raja, Claire Yoon, Sean M. HendryxSafety, Evaluation and Alignment, Reasoning, Oversight
10/11/2024Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsPriyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean M. Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, Zifan WangSafety, Evaluation and Alignment
9/29/2024Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsYung-Chieh Chan, George Pu, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam DentonPost-Training, Science of Data
9/27/2024Revisiting the Superficial Alignment HypothesisMohit Raghavendra, Vaskar Nath, Sean M. HendryxPost-Training
9/5/2024Planning In Natural Language Improves LLM Search For Code GenerationEvan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean M. Hendryx, Summer Yue, Hugh ZhangPost-Training
8/30/2024Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSpencer Whitehead, Jacob Phillips, Sean M. HendryxSafety, Evaluation and Alignment, Multimodal, Science of Data
8/27/2024LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetNathaniel Li, Ziwen Han, Ian Steneker, Willow E. Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, Summer YueSafety, Evaluation and Alignment
7/18/2024Learning Goal-Conditioned Representations for Language Reward ModelsVaskar Nath, Dylan Slack, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer Whitehead, Sean M. HendryxPost-Training
5/1/2024A Careful Examination of Large Language Model Performance on Grade School ArithmeticHugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean M. Hendryx, Russell Kaplan, Michele (Mike) Lunati, Summer YueSafety, Evaluation and Alignment
3/5/2024The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningNathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan HendrycksSafety, Evaluation and Alignment, Post-Training
1/22/2024Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyWill LeVine, Benjamin Pikus, Jacob Phillips, Berk Norman, Fernando Amat Gil, Sean M. HendryxComputer Vision
11/21/2023A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftWill LeVine, Benjamin Pikus, Anthony Chen, Sean M. HendryxPost-Training
10/5/2023A Holistic Approach For Test And Evaluation Of Large Language ModelsDylan Slack, Jean Wang, Denis Semenenko, Kate Park, Sean M. HendryxSafety, Evaluation and Alignment
10/4/2023On the Performance of Multimodal Language ModelsUtsav Garg, Erhan BasMultimodal, Post-Training
4/28/2023Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsGeorge Pu, Anirudh Jain, Jihan Yin, Russell KaplanPost-Training
4/11/2023Detecting and Preventing Hallucinations in Large Vision Language ModelsAnisha Gunjal, Jihan Yin, Erhan BasComputer Vision
3/11/2023Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsWill LeVine, Benjamin Pikus, Pranav Raja, Fernando Amat GilComputer Vision
1/29/2023Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingYatong Bai, Brendon G. Anderson, Aerin Kim, Somayeh SojoudiSafety, Evaluation and Alignment
3/7/2022GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionKareem Metwaly, Aerin Kim, Elliot Branson, Vishal MongaComputer Vision
11/16/2021CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesKareem Metwaly, Aerin Kim, Elliot Branson, Vishal MongaComputer Vision
11/7/2021Natural Adversarial ObjectsFelix Lau, Nishant Subramani, Sasha Harrison, Aerin Kim, Elliot Branson, Rosanne LiuComputer Vision
10/11/2021DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesJohn Pougué-Biyong, Valentina Semenova, Alexandre Matton, Rachel Han, Aerin Kim, Renaud Lambiotte, J. Doyne FarmerSafety, Evaluation and Alignment
7/31/2021On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsZeyad Emam, Andrew Kondrich, Sasha Harrison, Felix Lau, Yushi Wang, Aerin Kim, Elliot BransonComputer Vision, Science of Data
4/20/2021Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetMatthew Groh, Caleb Harris, Luis Soenksen, Felix Lau, Rachel Han, Aerin Kim, Arash Koochek, Omar BadriComputer Vision
11/27/2020A Survey of Deep Learning Approaches for OCR and Document UnderstandingNishant Subramani, Alexandre Matton, Malcolm Greaves, Adrian LamComputer Vision
4/13/2026
HiL-BENCH (Human-in-Loop Benchmark)Research
3/12/2026
Defensive Refusal Bias: How Safety Alignment Fails Cyber DefendersSafety
2/26/2026
LLM Novice Uplift on Dual-Use, In Silico Biology TasksSafety
2/25/2026
VeRO: An Evaluation Harness for Agents to Optimize AgentsAgents, Post-Training, Evaluation and Alignment
2/12/2026
LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment
1/15/2026
SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment
1/6/2026
Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment
12/22/2025
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and Alignment
12/18/2025
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and Alignment
12/17/2025
Audio MultiChallengeMultimodal, Safety, Evaluation and Alignment
11/25/2025
PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic ApproachSafety, Evaluation and Alignment
11/13/2025
Professional Reasoning BenchSafety, Evaluation and Alignment, Reasoning
11/10/2025
ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and Alignment
11/5/2025
Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and Alignment
10/28/2025
Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, Reasoning
10/20/2025
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and Alignment
10/15/2025
Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, Multimodal
10/8/2025
Online Rubrics Elicitation: Dynamically Curating Evaluation Criteria via Pairwise Comparisons for LLM Post-TrainingSafety, Evaluation and Alignment, Post-Training
9/25/2025
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of Data
9/23/2025
Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and Alignment
9/19/2025
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and Alignment
9/11/2025
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and Alignment
8/26/2025
Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, Oversight
8/13/2025
Search-Time Data ContaminationSafety, Evaluation and Alignment, Oversight
7/23/2025
MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and Alignment
7/23/2025
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-Training
7/21/2025
WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and Alignment
7/15/2025
Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and Alignment
6/28/2025
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, Reasoning
6/18/2025
FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and Alignment
6/16/2025
Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoning
6/13/2025
Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, Reasoning
6/5/2025
A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and Alignment
5/9/2025
Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of Data
3/14/2025
Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoning
3/8/2025
Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and Alignment
3/5/2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and Alignment
2/13/2025
ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and Alignment
2/11/2025
J2: Jailbreaking to JailbreakSafety, Evaluation and Alignment
2/10/2025
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and Alignment
1/29/2025
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, Reasoning
1/23/2025
Humanity's Last ExamSafety, Evaluation and Alignment, Reasoning
1/2/2025
ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, Oversight
10/11/2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and Alignment
9/29/2024
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of Data
9/27/2024
Revisiting the Superficial Alignment HypothesisPost-Training
9/5/2024
Planning In Natural Language Improves LLM Search For Code GenerationPost-Training
8/30/2024
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of Data
8/27/2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and Alignment
7/18/2024
Learning Goal-Conditioned Representations for Language Reward ModelsPost-Training
5/1/2024
A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and Alignment
3/5/2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-Training
1/22/2024
Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer Vision
11/21/2023
A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-Training
10/5/2023
A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and Alignment
10/4/2023
On the Performance of Multimodal Language ModelsMultimodal, Post-Training
4/28/2023
Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-Training
4/11/2023
Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer Vision
3/11/2023
Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer Vision
1/29/2023
Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and Alignment
3/7/2022
GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer Vision
11/16/2021
CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer Vision
11/7/2021
Natural Adversarial ObjectsComputer Vision
10/11/2021
DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and Alignment
7/31/2021
On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of Data
4/20/2021
Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer Vision
11/27/2020
A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer Vision
HiL-BENCH (Human-in-Loop Benchmark)
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
VeRO: An Evaluation Harness for Agents to Optimize Agents
LHAW: Controllable Underspecification for Long-Horizon Tasks
SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?
Agentic Rubrics as Contextual Verifiers for SWE Agents
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
Audio MultiChallenge
PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach
Professional Reasoning Bench
ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models
Remote Labor Index: Measuring AI Automation of Remote Work
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and Reasoning
Online Rubrics Elicitation: Dynamically Curating Evaluation Criteria via Pairwise Comparisons for LLM Post-Training
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Progress over Points: Reframing LM Benchmarks Around Scientific Objectives
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models
Reliable Weak-to-Strong Monitoring of LLM Agents
Search-Time Data Contamination
MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
WebGuard: Building a Generalizable Guardrail for Web Agents
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models
Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards
A Red Teaming Roadmap Towards System-Level Safety
Assessing Robustness to Spurious Correlations in Post-Training Language Models
Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking
Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning Challenges
J2: Jailbreaking to Jailbreak
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Humanity's Last Exam
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs
Revisiting the Superficial Alignment Hypothesis
Planning In Natural Language Improves LLM Search For Code Generation
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Learning Goal-Conditioned Representations for Language Reward Models
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy
A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift
A Holistic Approach For Test And Evaluation Of Large Language Models
On the Performance of Multimodal Language Models
Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs
Detecting and Preventing Hallucinations in Large Vision Language Models
Enabling Calibration In The Zero-shot Inference Of Large Vision-Language Models
Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing
GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction
CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous Vehicles
Natural Adversarial Objects
DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debates
On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning Models
Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset
A Survey of Deep Learning Approaches for OCR and Document Understanding

HiL-BENCH (Human-in-Loop Benchmark)

67 papers found

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy