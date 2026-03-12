Date Title Category Authors

Date Title

02.25.2026

VeRO: An Evaluation Harness for Agents to Optimize Agents

Agents, Post-Training, Evaluation and Alignment

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan (Emily) Xue, Sam Denton

02.12.2026

LHAW: Controllable Underspecification for Long-Horizon Tasks

Agents, Safety, Evaluation and Alignment

George Pu* , Michael S. Lee* , Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, and Samuel Marc Denton *Indicates equal contribution

01.15.2026

SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?

Safety, Evaluation and Alignment

Udari Madhushani Sehwag1, Elaine Lau1†, Haniyeh Ehsani Oskouie2,5, Shayan Shabihi3, Erich Liang4,5, Andrea Toledo1, Guillermo Mangialardi1, Sergio Fonrouge1, Ed-Yeremai Hernández Cardona1, Paula Vergara1, Utkarsh Tyagi1, Chen Bo Calvin Zhang1, Pavi Bhatter1, Nicholas Johnson1, Furong Huang3, Ernesto Gabriel Hernández Montoya1, and Bing Liu1 1Scale AI, 2University of California, Los Angeles, 3University of Maryland, 4Princeton University, 5Human Frontier Collective, Scale AI †Work done while at Scale AI

01.06.2026

Agentic Rubrics as Contextual Verifiers for SWE Agents

Agents, Safety, Evaluation and Alignment

Mohit Raghavendra*, Anisha Gunjal*, Bing Liu, Yunzhong He *Equal contribution.

12.22.2025

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Reasoning, Safety, Evaluation and Alignment

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine

12.18.2025

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Agents, Reasoning, Safety, Evaluation and Alignment

Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, Bing Liu

12.17.2025

Audio MultiChallenge

Multimodal, Safety, Evaluation and Alignment

Advait Gosai*, Tyler Vuong*, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, Yunzhong He *Indicates equal contribution

11.25.2025

PropensityBench

Safety, Evaluation and Alignment

Udari Madhushani Sehwag1* , Shayan Shabihi2* , Alex McAvoy3 , Vikash Sehwag4 , Yuancheng Xu5, Dalton Towers6 , Furong Huang2 1Scale AI, 2University of Maryland, College Park, 3University of North Carolina at Chapel Hill, 4Google DeepMind, 5Netflix, 6University of Texas at Austin * Equal Contributions

11.13.2025

Professional Reasoning Benchmark

Safety, Evaluation and Alignment, Reasoning

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, and Yunzhong He

11.10.2025

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Reasoning, Safety, Evaluation and Alignment

Manasi Sharma1, Chen Bo Calvin Zhang1, Chaithanya Bandi1, Clinton Wang†, Ankit Aich1, Huy Nghiem2, Tahseen Rabbani3, Ye Htet4, Brian Jang1 , Sumana Basu5 , Aishwarya Balwani1, Denis Peskoff6 , Marcos Ayestaran1 , Sean M. Hendryx†, Brad Kenstler1, Bing Liu1 1Scale AI, 2University of Maryland, 3University of Chicago, 4Washington University, St. Louis, 5McGill University, 6University of California, Berkeley †Work conducted while at Scale AI

11.05.2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Safety, Evaluation and Alignment

Boyi Wei1, 2∗† , Zora Che1, 3∗† , Nathaniel Li1†, Udari Madhushani Sehwag1 , Jasper Götting4 , Samira Nedungadi4 , Julian Michael1†, Summer Yue1†, Dan Hendrycks5 , Peter Henderson2 , Zifan Wang1†, Seth Donoughe4 , Mantas Mazeika5 1Scale AI, 2Princeton University, 3University of Maryland, 4SecureBio, 5Center for AI Safety ∗ Equal Contributions, † Work done while at Scale AI

10.28.2025

Remote Labor Index: Measuring AI Automation of Remote Work

Agents, Safety, Evaluation and Alignment, Reasoning

Mantas Mazeika∗1 , Alice Gatti∗1 , Cristina Menghini∗† , Udari Madhushani Sehwag∗2 , Shivam Singhal∗†, Yury Orlovskiy∗1, Steven Basart1 , Manasi Sharma2 , Denis Peskoff2 , Elaine Lau2 , Sumana Basu2 , Jaehyuk Lim1 , Lachlan Carroll1 , Alice Blair1 , Vinaya Sivakumar1 , Brad Kenstler2 , Yuntao Ma† , Julian Michael† , Xiaoke Li1 , Oliver Ingebretsen1 , Aditya Mehta1 , Jean Mottola1 , John Teichmann‡ , Kevin Yu‡ , Zaina Shaik‡ , Adam Khoja1 , Richard Ren1 , Jason Hausenloy1 , Long Phan1 , Connor Smith1 , Ye Htet2 , Ankit Aich2 , Tahseen Rabbani2 , Vivswan Shah† , Andriy Novykov1 , Felix Binder† Kirill Chugunov2 , Luis Ramirez2 , Matias Geralnik2 , Hernán Mesura2 , Dean Lee2 , Ed-Yeremai Hernandez Cardona2 , Annette Diamond2 Summer Yue**†, Alexandr Wang**†, Bing Liu**2, Ernesto Hernandez**2 , Dan Hendrycks**1 1Center for AI Safety 2Scale AI *Equal contribution **Senior authors †Work done while at Scale AI ‡Work done while at CAIS

10.20.2025

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Reasoning, Agents, Safety, Evaluation and Alignment

Zafir Stojanovski1∗, Oliver Stanley1,2∗, Joe Sharratt1∗,Richard Jones1∗,Abdulhakeem Adefioye1, Jean Kaddour3† Andreas Köpf1† 1Open-Thought, 2Scale AI, 3University College London

10.15.2025

Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and Reasoning

Safety, Evaluation and Alignment, Reasoning, Multimodal

Xingang Guo1,2 , Utkarsh Tyagi1 , Advait Gosai1 , Paula Vergara1 , Ernesto Gabriel Hernandez Montoya1 , Chen Bo Calvin Zhang1 , Bin Hu2 , Yunzhong He1 , Bing Liu1 , Rakshith Sharma Srinivasa1 1Scale AI, 2University of Illinois at Urbana-Champaign

10.08.2025

Online Rubrics Elicitation from Pairwise Comparisons

Safety, Evaluation and Alignment, Post-Training

MohammadHossein Rezaei1,2,∗, Robert Vacareanu1, Zihao Wang1, Clinton Wang1, Bing Liu1, Yunzhong He1 , and Afra Feyza Akyürek1 1Scale AI, 2University of Arizona *Work done during internship at Scale AI

09.25.2025

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Post-Training, Science of Data

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin

09.23.2025

Progress over Points: Reframing LM Benchmarks Around Scientific Objectives

Safety, Evaluation and Alignment

Alwin Jin, Sean M. Hendryx, Vaskar Nath

09.19.2025

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Agents, Safety, Evaluation and Alignment

Xiang Deng*, Jeff Da*, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, Brad Kenstler *Co-first author and equal contributions.

09.11.2025

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Safety, Evaluation and Alignment

Rakshith S Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, Chen Xing

08.26.2025

Reliable Weak-to-Strong Monitoring of LLM Agents

Safety, Evaluation and Alignment, Oversight

Neil Kale1, 2, †, Chen Bo Calvin Zhang1, *, Kevin Zhu1, 3, †, *, Ankit Aich1 , Paula Rodriguez1 , Scale Red Team1 , Christina Q. Knight1 , and Zifan Wang1 1Scale AI, 2Carnegie Mellon University, 3Massachusetts Institute of Technology * Equal Contributions, †Work done during internship at Scale AI

08.13.2025

Search-Time Data Contamination

Safety, Evaluation and Alignment, Oversight

Ziwen Han, Meher Mankikar, Julian Michael, and Zifan Wang

07.23.2025

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Reasoning, Safety, Evaluation and Alignment

Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing

07.23.2025

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Science of Data, Post-Training

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, Sean Hendryx

07.21.2025

WebGuard: Building a Generalizable Guardrail for Web Agents

Agents, Safety, Evaluation and Alignment

Boyuan Zheng1, Zeyi Liao1, Scott Salisbury1, Zeyuan Liu1, Michael Lin1, Qinyuan Zheng1, Zifan Wang2, Xiang Deng2, Dawn Song3, Huan Sun1, Yu Su1 1The Ohio State University 2Scale AI 3University of California, Berkeley

07.15.2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Reasoning, Oversight, Safety, Evaluation and Alignment

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik

06.28.2025

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Post-Training, Reasoning

Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael

06.18.2025

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

Safety, Evaluation and Alignment

Christina Q. Knight∗, Kaustubh Deshpande⋄, Ved Sirdeshmukh⋄, Meher Mankikar, Scale Red Team, SEAL Research Team, and Julian Michael ∗ Project Lead, ⋄ Equal Contribution

06.16.2025

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

Reasoning

Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, Sean Hendryx

06.13.2025

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

Agents, Post-Training, Reasoning

Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, Sean Hendryx

06.05.2025

A Red Teaming Roadmap Towards System-Level Safety

Safety, Evaluation and Alignment

Zifan Wang, Christina Q. Knight, Jeremy Kritz, Willow E. Primack, Julian Michael

05.09.2025

Assessing Robustness to Spurious Correlations in Post-Training Language Models

Post-Training, Science of Data

Julia Shuieh, Prasann Singhal, Apaar Shanker, John Heyer, George Pu, Samuel Denton

03.14.2025

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking

Reasoning

Will LeVine, Bijan Varjavand

03.08.2025

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models

Safety, Evaluation and Alignment

Benjamin Jensen, Ian Reynolds, Yasir Atalan, Michael Garcia, Austin Woo, Anthony Chen, Trevor Howarth

03.05.2025

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Safety, Evaluation and Alignment

Richard Ren∗1, Arunim Agarwal∗1, Mantas Mazeika∗1, Cristina Menghini∗2, Robert Vacareanu2, Brad Kenstler2, Mick Yang1, Isabelle Barrass1, Alice Gatti1, Xuwang Yin1, Eduardo Trevino2, Matias Geralnik2, Adam Khoja1,Dean Lee2, Summer Yue2, Dan Hendrycks1 1Center for AI Safety 2Scale AI *Equal contribution.

02.13.2025

ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning Challenges

Reasoning, Safety, Evaluation and Alignment

Clinton J. Wang1, Dean Lee1 , Cristina Menghini1, Johannes Mols1, Jack Doughty1, Adam Khoja2, Jayson Lynch3, Sean Hendryx1, Summer Yue1, Dan Hendrycks2 1Scale AI, 2Center for AI Safety, 3MIT

02.11.2025

J2: Jailbreaking to Jailbreak

Safety, Evaluation and Alignment

Jeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang

02.10.2025

ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Safety, Evaluation and Alignment

Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen Xing

01.29.2025

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Safety, Evaluation and Alignment, Reasoning

Ved Sirdeshmukh*, Kaustubh Deshpande*, Johannes Mols*, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, Chen Xing *Indicates Equal Contribution

01.23.2025

Humanity's Last Exam

Safety, Evaluation and Alignment, Reasoning

Long Phan∗1 , Alice Gatti∗1 , Ziwen Han∗2 , Nathaniel Li∗1 , Josephina Hu2 , Hugh Zhang‡, Sean Shi2, Michael Choi2, Anish Agrawal2, Arnav Chopra2, Adam Khoja1, Ryan Kim†, Richard Ren1, Jason Hausenloy1, Oliver Zhang1 , Mantas Mazeika1 , Summer Yue∗∗2 , Alexandr Wang∗∗2 , Dan Hendrycks∗∗1 1 Center for AI Safety, 2 Scale AI ∗Co-first Authors. ∗∗ Senior Authors. † Work conducted while at Center for AI Safety. ‡ Work conducted while at Scale AI. Refer to PDF for full list of Dataset Contributors.

01.02.2025

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Safety, Evaluation and Alignment, Reasoning, Oversight

Vaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx

10.11.2024

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

Safety, Evaluation and Alignment

Priyanshu Kumar 1, Elaine Lau 3, Saranya Vijayakumar 1, Tu Trinh 3, Scale Red Team 3, Elaine Chang 3, Vaughn Robinson 3, Sean Hendryx 3, Shuyan Zhou 1, Matt Fredrikson 1, 2, Summer Yue 3, Zifan Wang 3 1 Carnegie Mellon University, 2 GraySwan Al, 3 Scale Al

09.29.2024

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Post-Training, Science of Data

Yung-Chieh Chan∗, George Pu∗, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam Denton ∗Denotes equal contribution. Work was done while Yung-Chieh was interning at Scale AI.

09.27.2024

Revisiting the Superficial Alignment Hypothesis

Post-Training

Mohit Raghavendra°1 , Vaskar Nath2 , Sean Hendryx2 1Georgia Institute of Technology, 2Scale AI, °Work conducted while at Scale AI

09.05.2024

Planning In Natural Language Improves LLM Search For Code Generation

Post-Training

Evan Wang 1, 2, Federico Cassano o3,4, Catherine Wu o, Yunfeng Bai 1, Will Song 1, Vaskar Nath 1, Ziwen Han 1, Sean Hendryx 1, Summer Yue 1, Hugh Zhang 1 1 Scale AI , 2 California Institute of Technology, 3 Northeastern University, 4 Cursor AI, o Work conducted while at Scale AI

08.30.2024

Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data

Safety, Evaluation and Alignment, Multimodal, Science of Data

Spencer Whitehead, Jacob Phillips, Sean Hendryx

08.27.2024

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Safety, Evaluation and Alignment

Nathaniel Li 1, 2, Ziwen Han 1, Ian Steneker 1, Willow Primack 1, Riley Goodside 1, Hugh Zhang 1, Zifan Wang 1, Cristina Menghini 1, Summer Yue 1 1 Scale AI , 2 UC Berkeley

07.18.2024

Learning Goal-Conditioned Representations for Language Reward Models

Post-Training

Vaskar Nath∗†, Dylan Slack∗, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer Whitehead‡, Sean Hendryx‡ ∗Equal contribution †Corresponding author: [email protected] ‡Equal senior authorship

05.01.2024

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Safety, Evaluation and Alignment

Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele (Mike) Lunati†, Summer Yue†

03.05.2024

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Safety, Evaluation and Alignment, Post-Training

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks

01.22.2024

Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy

Computer Vision

Will LeVine, Benjamin Pikus, Jacob Phillips, Berk Norman, Fernando Amat Gil, Sean Hendryx

11.21.2023

A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift

Post-Training

Will LeVine, Benjamin Pikus, Anthony Chen, Sean Hendryx

10.05.2023

A Holistic Approach For Test And Evaluation Of Large Language Models

Safety, Evaluation and Alignment

Dylan Slack*, Jean Wang*, Denis Semenenko*, Kate Park, Sean Hendryx *Equal Contribution

10.04.2023

On the Performance of Multimodal Language Models

Multimodal, Post-Training

Utsav Garg, Erhan Bas

04.28.2023

Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs

Post-Training

George Pu, Anirudh Jain, Jihan Yin, Russell Kaplan

04.11.2023

Detecting and Preventing Hallucinations in Large Vision Language Models

Computer Vision

Anisha Gunjal*, Jihan Yin*, Erhan Bas† *These authors contributed equally. †Work done at Scale AI.

03.11.2023

Enabling Calibration In The Zero-shot Inference Of Large Vision-Language Models

Computer Vision

Will Levine† , Benjamin Pikus† , Pranav Raja & Fernando Amat Gil † denotes equal contribution

01.29.2023

Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing

Safety, Evaluation and Alignment

Yatong Bai, Brendon G. Anderson, Aerin Kim, Somayeh Sojoudi

03.07.2022

GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction

Computer Vision

Kareem Metwaly, Aerin Kim, Elliot Branson, Vishal Monga

11.16.2021

CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous Vehicles

Computer Vision

Kareem Metwaly, Aerin Kim, Elliot Branson, Vishal Monga

11.07.2021

Natural Adversarial Objects

Computer Vision

Felix Lau, Nishant Subramani, Sasha Harrison, Aerin Kim, Elliot Branson, Rosanne Liu

10.11.2021

DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debates

Safety, Evaluation and Alignment

John Pougué-Biyong*, Valentina Semenova*, Alexandre Matton, Rachel Han, Aerin Kim, Renaud Lambiotte, J. Doyne Farmer *Equal contribution

07.31.2021

On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning Models

Computer Vision, Science of Data

Zeyad Emam1 2, Andrew Kondrich 1, Sasha Harrison 1, Felix Lau 1, Yushi Wang 1, Aerin Kim 1, Elliot Branson 1

04.20.2021

Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset

Computer Vision

Matthew Groh, Caleb Harris, Luis Soenksen, Felix Lau, Rachel Han, Aerin Kim, Arash Koochek, Omar Badri

11.27.2020

A Survey of Deep Learning Approaches for OCR and Document Understanding

Computer Vision

Nishant Subramani, Alexandre Matton, Malcolm Greaves, Adrian Lam

02.25.2026

VeRO: An Evaluation Harness for Agents to Optimize Agents Agents, Post-Training, Evaluation and Alignment

02.12.2026

LHAW: Controllable Underspecification for Long-Horizon Tasks Agents, Safety, Evaluation and Alignment

01.15.2026

SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences? Safety, Evaluation and Alignment

01.06.2026

Agentic Rubrics as Contextual Verifiers for SWE Agents Agents, Safety, Evaluation and Alignment

12.22.2025

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes Reasoning, Safety, Evaluation and Alignment

12.18.2025

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers Agents, Reasoning, Safety, Evaluation and Alignment

12.17.2025

Audio MultiChallenge Multimodal, Safety, Evaluation and Alignment

11.25.2025

PropensityBench Safety, Evaluation and Alignment

11.13.2025

Professional Reasoning Benchmark Safety, Evaluation and Alignment, Reasoning

11.10.2025

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents Reasoning, Safety, Evaluation and Alignment

11.05.2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models Safety, Evaluation and Alignment

10.28.2025

Remote Labor Index: Measuring AI Automation of Remote Work Agents, Safety, Evaluation and Alignment, Reasoning

10.20.2025

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards Reasoning, Agents, Safety, Evaluation and Alignment

10.15.2025

Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and Reasoning Safety, Evaluation and Alignment, Reasoning, Multimodal

10.08.2025

Online Rubrics Elicitation from Pairwise Comparisons Safety, Evaluation and Alignment, Post-Training

09.25.2025

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training Post-Training, Science of Data

09.23.2025

Progress over Points: Reframing LM Benchmarks Around Scientific Objectives Safety, Evaluation and Alignment

09.19.2025

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Agents, Safety, Evaluation and Alignment

09.11.2025

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models Safety, Evaluation and Alignment

08.26.2025

Reliable Weak-to-Strong Monitoring of LLM Agents Safety, Evaluation and Alignment, Oversight

08.13.2025

Search-Time Data Contamination Safety, Evaluation and Alignment, Oversight

07.23.2025

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs Reasoning, Safety, Evaluation and Alignment

07.23.2025

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains Science of Data, Post-Training

07.21.2025

WebGuard: Building a Generalizable Guardrail for Web Agents Agents, Safety, Evaluation and Alignment

07.15.2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety Reasoning, Oversight, Safety, Evaluation and Alignment

06.28.2025

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning Post-Training, Reasoning

06.18.2025

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety Safety, Evaluation and Alignment

06.16.2025

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models Reasoning

06.13.2025

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards Agents, Post-Training, Reasoning

06.05.2025

A Red Teaming Roadmap Towards System-Level Safety Safety, Evaluation and Alignment

05.09.2025

Assessing Robustness to Spurious Correlations in Post-Training Language Models Post-Training, Science of Data

03.14.2025

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking Reasoning

03.08.2025

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models Safety, Evaluation and Alignment

03.05.2025

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems Safety, Evaluation and Alignment

02.13.2025

ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning Challenges Reasoning, Safety, Evaluation and Alignment

02.11.2025

J2: Jailbreaking to Jailbreak Safety, Evaluation and Alignment

02.10.2025

ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms Safety, Evaluation and Alignment

01.29.2025

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs Safety, Evaluation and Alignment, Reasoning

01.23.2025

Humanity's Last Exam Safety, Evaluation and Alignment, Reasoning

01.02.2025

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark Safety, Evaluation and Alignment, Reasoning, Oversight

10.11.2024

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents Safety, Evaluation and Alignment

09.29.2024

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs Post-Training, Science of Data

09.27.2024

Revisiting the Superficial Alignment Hypothesis Post-Training

09.05.2024

Planning In Natural Language Improves LLM Search For Code Generation Post-Training

08.30.2024

Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data Safety, Evaluation and Alignment, Multimodal, Science of Data

08.27.2024

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Safety, Evaluation and Alignment

07.18.2024

Learning Goal-Conditioned Representations for Language Reward Models Post-Training

05.01.2024

A Careful Examination of Large Language Model Performance on Grade School Arithmetic Safety, Evaluation and Alignment

03.05.2024

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Safety, Evaluation and Alignment, Post-Training

01.22.2024

Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy Computer Vision

11.21.2023

A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift Post-Training

10.05.2023

A Holistic Approach For Test And Evaluation Of Large Language Models Safety, Evaluation and Alignment

10.04.2023

On the Performance of Multimodal Language Models Multimodal, Post-Training

04.28.2023

Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs Post-Training

04.11.2023

Detecting and Preventing Hallucinations in Large Vision Language Models Computer Vision

03.11.2023

Enabling Calibration In The Zero-shot Inference Of Large Vision-Language Models Computer Vision

01.29.2023

Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing Safety, Evaluation and Alignment

03.07.2022

GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction Computer Vision

11.16.2021

CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous Vehicles Computer Vision

11.07.2021

Natural Adversarial Objects Computer Vision

10.11.2021

DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debates Safety, Evaluation and Alignment

07.31.2021

On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning Models Computer Vision, Science of Data

04.20.2021

Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset Computer Vision

11.27.2020