Scale AI and the Center for AI Safety are proud to introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real world, economically valuable remote-work tasks designed to evaluate end-to-end agent performance in practical settings. Across evaluated frontier AI agent frameworks, performance sits near the floor, with a maximum automation rate of 2.5% on RLI tasks. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking progress and enabling stakeholders to proactively navigate the onset and risks of AI-driven labor automation.
The potential for AIs to automate human labor is a topic of significant interest and concern. While AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, it remains unclear how these gains translate into real economic value and actual automation. To address this gap, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising realworld, economically valuable remote-work tasks designed to evaluate end-to-end agent performance in practical settings. Across evaluated frontier AI agent frameworks, performance sits near the floor, with a maximum automation rate of 2.5% on RLI tasks. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking progress and enabling stakeholders to proactively navigate the onset and risks of AI-driven labor automation.
Check out the leaderboard: https://scale.com/leaderboard/rli