Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
BACK
Evaluation and AlignmentAgents5/7/2026

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

Mohit Raghavendra, Soham Dan, Miguel Romero Calvo, Yannis Yiming He, Johannes Baptist Mols, Gautam Anand, Cole McCollum, Edgar Arakelyan, Vijay Bharadwaj, Andrew Park, Jeff Da, Mohammad Hossein Rezaei, Bing Liu, Brad Kenstler, Yunzhong He

View paperView leaderboardGitHub

A benchmark suite evaluating coding agents on three professional software engineering workflows beyond bug-fixing: Codebase Q&A, Test Writing, and Refactoring.

We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase QA (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance, while even the best open-weight models score poorly. Our analysis suggests that top models rely on extensive codebase exploration and runtime-driven reasoning. However, even top models consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices. Overall, SWE Atlas provides a complementary evaluation suite for measuring both correctness and engineering quality in coding agents.

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy