Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]

[LEADERBOARDS]

testing the limits of AI.

Benchmarks for frontier, agentic, and safety capabilities

Start Exploring
Benchmarks20+

Including benchmarks on agentic coding, frontier reasoning, and safety alignment.

Models evaluated100+

From leading AI labs including OpenAI, Anthropic, Google, Meta, and open-source contributors.

SWE Atlas - Codebase QnA

Evaluating deep code comprehension and reasoning

1

gpt-5.4-codex (xHigh) (Codex CLI)

35.48±8.70

1

claude-opus-4.6 Thinking (Claude Code Harness)^

31.50±8.62

1

gpt-5.2-2025-12-11 (High) (SWE-Agent)

29.03±8.53

View Full Ranking →

MCP Atlas

Evaluating real-world tool use through the Model Context Protocol (MCP)

1

claude-opus-4-5-20251101

62.30±1.76

1

gpt-5.2-2025-12-11

NEW

60.57±1.62

3

gemini-3-flash-preview

NEW

57.40±1.48

View Full Ranking →

SWE-Bench Pro (Public Dataset)

Evaluating long-horizon software engineering tasks in public open source repositories

1

claude-opus-4-5-20251101

45.89±3.60

1

claude-4-5-Sonnet

43.60±3.60

1

gemini-3-pro-preview

43.30±3.60

View Full Ranking →

SWE-Bench Pro (Private Dataset)

Evaluating long-horizon software engineering tasks in commercial-grade private repositories

1

gpt-5.2-2025-12-11

NEW

23.81±5.09

1

claude-opus-4-5-20251101

NEW

23.44±5.07

1

gemini-3-pro-preview

NEW

17.95±4.78

View Full Ranking →

SciPredict

Forecasting scientific experiment outcomes

1

gemini-3-pro-preview

25.27±1.92

1

claude-opus-4-5-20251101

23.05±0.51

1

claude-opus-4-1-20250805

22.22±1.48

View Full Ranking →

Humanity's Last Exam

Challenging LLMs at the frontier of human knowledge

1

37.52±1.90

1

34.44±1.86

2

31.64±1.82

View Full Ranking →

Humanity's Last Exam (Text Only)

Challenging LLMs at the frontier of human knowledge

1

37.72±2.04

1

36.24±2.03

2

33.32±1.99

View Full Ranking →

AudioMultiChallenge

Evaluating spoken dialogue systems in multi-turn interaction

1

gemini-3-pro-preview (Thinking)*

54.65±4.57

1

gemini-2.5-pro (Thinking)*

46.90±4.58

2

gemini-2.5-flash (Thinking)*

40.04±4.50

View Full Ranking →

AudioMultiChallenge - Audio Output

Evaluating spoken dialogue systems in multi-turn interaction

1

gpt-realtime-1.5

NEW

34.73±4.38

2

Qwen3-Omni-30B-A3B-Instruct

24.34±3.95

2

gpt-4o-audio-preview-2025-06-03

23.23±3.88

View Full Ranking →

AudioMultiChallenge - Text Output

Evaluating spoken dialogue systems in multi-turn interaction

1

gemini-3-pro-preview (Thinking)

54.65±4.57

1

gemini-2.5-pro (Thinking)

46.90±4.58

2

gemini-2.5-flash (Thinking)

40.04±4.50

View Full Ranking →

Professional Reasoning Benchmark - Finance

Evaluating Professional Reasoning in Finance

1

claude-opus-4-6 (Non-Thinking)

NEW

53.28±0.18

2

gpt-5

51.32±0.17

2

gpt-5-pro

51.06±0.59

View Full Ranking →

Professional Reasoning Benchmark - Legal

Evaluating Professional Reasoning in Legal Practice

1

claude-opus-4-6 (Non-Thinking)

NEW

52.27±0.66

2

gpt-5-pro

49.89±0.36

2

o3-pro

49.67±0.50

View Full Ranking →

Remote Labor Index (RLI)

Evaluating AI agents ability to perform real-world, economically valuable remote work

1

claude-opus-4-6 (CoWork)

NEW

4.17±0.00

2

claude-opus-4-5-20251101-thinking

3.75±0.00

3

Manus_1.6 (Max)

NEW

2.92±0.00

View Full Ranking →

PropensityBench

Simulating real-world pressure to choose between safe or harmful behavior

1

o3-2025-04-16

10.50±0.60

2

claude-sonnet-4-20250514

12.20±0.20

3

o4-mini-2025-04-16

15.80±0.40

View Full Ranking →

VisualToolBench (VTB)

Evaluating how LLMs can dynamically interact with and reason about visual information

1

gemini-3-pro-preview

NEW

26.85±0.54

1

gpt-5-2025-08-07-thinking

18.68±0.25

2

gpt-5-2025-08-07

16.96±0.06

View Full Ranking →

MultiNRC

Multilingual Native Reasoning Evaluation Benchmark for LLMs

1

65.20±1.24

2

58.96±2.97

2

57.06±2.99

View Full Ranking →

MultiChallenge

Assessing models across diverse, interdisciplinary challenges

1

gemini-3-pro-preview

65.67±2.20

1

gpt-5.1-2025-11-13-thinking

63.41±2.11

1

gpt-5-thinking

63.19±1.63

View Full Ranking →

Fortress

Frontier Risk Evaluation for National Security and Public Safety

1

8.24±1.93

1

9.63±2.11

2

12.80±2.36

View Full Ranking →

MASK

Evaluate model honesty when pressured to lie

1

96.28±0.41

1

96.13±0.57

1

Claude Sonnet 4 (Thinking)

95.33±2.29

View Full Ranking →

EnigmaEval

Evaluating model performance on complex, multi-step reasoning tasks

1

18.75±2.22

1

18.24±2.20

3

13.09±1.92

View Full Ranking →

VISTA

Vision-Language Understanding benchmark for multimodal models

1

Gemini 2.5 Pro Experimental (March 2025)

54.65±1.46

1

gemini-2.5-pro-preview-06-05

54.63±0.55

2

gpt-5-pro-2025-10-06

52.39±1.07

View Full Ranking →

TutorBench

Evaluating model performance on common tutoring tasks for high school and AP-level subjects

1

gemini-2.5-pro-preview-06-05

55.65±1.11

1

gpt-5-2025-08-07

55.33±1.02

1

o3-pro-2025-06-10

54.62±1.02

View Full Ranking →

Frontier AI Model Evaluations & Benchmarks

We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities--while continuously evaluating the latest frontier models.

Scaling with Human Expertise

Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations--ensuring efficiency and alignment with human judgment.

Robust Datasets for Reliable AI Benchmarks

Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting and open-source datasets for broad benchmarking and comparability.

[EVALUATE YOUR MODEL]

If you'd like to add your model to this leaderboard or a future version, please contact [email protected]. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy