SWE Atlas - Codebase QnA

Name: SWE Atlas - Codebase QnA
Keywords: LLM leaderboard, AI benchmarks, model rankings

Evaluating deep code comprehension and reasoning

Overview

SWE-Atlas is a benchmark for evaluating AI coding agents across a spectrum of professional software engineering tasks. Rather than measuring a single skill in isolation, SWE-Atlas consists of three leaderboards that target distinct and complementary capabilities:

Codebase QnA - Understand complex codebases through runtime analysis and multi-file reasoning
Test Writing - Write meaningful tests that exercise real functionality to increase code coverage

Model name	Thinking settings	Temperature	Max input tokens
Claude Sonnet 4.5	High	1.0	1,000,000
Claude Opus 4.6	High	1.0	1,000,000
GPT-5.2 (High)	High	1.0	400,000
GPT-5.3 Codex	xHigh	1.0	400,000
Gemini 3.1 Pro (Preview)	High	1.0	1,000,000
Gemini 3 Flash (Preview)	High	1.0	1,000,000
Qwen3 Coder 480B A35B	Default (High)	0.7	256,000
MiniMax M2.5	Default (High)	1.0	200,000
Kimi K2.5	Default (High)	1.0	200,000
GLM-5	Default (High)	0.7	128,000

Model	Agentic coding setting (full tools)	Non-agentic setting (search/view only, no bash)
GPT-5.2 Codex	29.03%	17.7% (↓40%)
Claude Opus 4.6	29.03%	16.1% (↓45%)

SWE Atlas - Codebase QnA

Overview

SWE Atlas - Codebase QnA

Overview

Codebase QnA

Methodology

Eval Metric: Task Resolve Rate

Results

Codebase Exploration

Failure Modes

Model Settings

Performance Comparison