Remote Labor Index - Scale Labs

Remote Labor Index: Measuring AI Automation of Remote Work

Mantas Mazeika∗1 , Alice Gatti∗1 , Cristina Menghini∗† , Udari Madhushani Sehwag∗2 , Shivam Singhal∗†, Yury Orlovskiy∗1, Steven Basart1 , Manasi Sharma2 , Denis Peskoff2 , Elaine Lau2 , Sumana Basu2 , Jaehyuk Lim1 , Lachlan Carroll1 , Alice Blair1 , Vinaya Sivakumar1 , Brad Kenstler2 , Yuntao Ma† , Julian Michael† , Xiaoke Li1 , Oliver Ingebretsen1 , Aditya Mehta1 , Jean Mottola1 , John Teichmann‡ , Kevin Yu‡ , Zaina Shaik‡ , Adam Khoja1 , Richard Ren1 , Jason Hausenloy1 , Long Phan1 , Connor Smith1 , Ye Htet2 , Ankit Aich2 , Tahseen Rabbani2 , Vivswan Shah† , Andriy Novykov1 , Felix Binder† Kirill Chugunov2 , Luis Ramirez2 , Matias Geralnik2 , Hernán Mesura2 , Dean Lee2 , Ed-Yeremai Hernandez Cardona2 , Annette Diamond2 Summer Yue**†, Alexandr Wang**†, Bing Liu**2, Ernesto Hernandez**2 , Dan Hendrycks**1 1Center for AI Safety 2Scale AI *Equal contribution **Senior authors †Work done while at Scale AI ‡Work done while at CAIS

Scale AI and the Center for AI Safety are proud to introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real world, economically valuable remote-work tasks designed to evaluate end-to-end agent performance in practical settings. Across evaluated frontier AI agent frameworks, performance sits near the floor, with a maximum automation rate of 2.5% on RLI tasks. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking progress and enabling stakeholders to proactively navigate the onset and risks of AI-driven labor automation.

The potential for AIs to automate human labor is a topic of significant interest and concern. While AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, it remains unclear how these gains translate into real economic value and actual automation. To address this gap, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising realworld, economically valuable remote-work tasks designed to evaluate end-to-end agent performance in practical settings. Across evaluated frontier AI agent frameworks, performance sits near the floor, with a maximum automation rate of 2.5% on RLI tasks. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking progress and enabling stakeholders to proactively navigate the onset and risks of AI-driven labor automation.

Check out the leaderboard: https://scale.com/leaderboard/rli

Remote Labor Index: Measuring AI Automation of Remote Work