Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]

[SHOWDOWN]

Showdown Leaderboard - LLMs

Real people. Real conversations. Real rankings.

Showdown ranks AI models based on how they perform in real-world use -- not synthetic tests or lab settings. Votes are blind, optional, and organic, so rankings reflect authentic preferences.

Methodology & Technical ReportCompare Models
Prompts compared0

Real conversation prompts compared across models through pairwise votes.

Active users0

From 80+ countries and 70+ languages, spanning all backgrounds and professions.

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy

Leaderboard - LLMs

* This model's API does not consistently return Markdown-formatted responses. Since raw outputs are used in head-to-head comparisons, this may affect its ranking.

Performance Comparison Across Language Models

Win Rate vs. Each Model

Win Rate vs. Each Model

Battle Count vs. Each Model

Battle Count vs. Each Model

Confidence Intervals

Confidence Intervals

Average Win Rate

Average Win Rate

Prompt Distribution

Prompt Distribution

Style Control
1

gemini-3-pro-preview

gemini-3-pro-preview
2,287
1051.89-10.41 +12.99
-10.41 +12.99
1

gemini-3-flash

gemini-3-flash
2,247
1047.08-9.94 +10.94
-9.94 +10.94
3

qwen3-omni

qwen3-omni
886
1000.00-14.96 +16.99
-14.96 +16.99
3

gpt-4o-audio-preview-2025-06-03

gpt-4o-audio-preview-2025-06-03
2,726
997.86-10.65 +9.71
-10.65 +9.71
5

voxtral-small-24b-2507

voxtral-small-24b-2507
775
918.49-17.35 +13.20
-17.35 +13.20
5

gemma3n

gemma3n
939
892.85-10.86 +18.03
-10.86 +18.03
7

gpt-realtime

gpt-realtime
2,800
854.75-8.34 +10.79
-8.34 +10.79
8

phi-4-multimodal-instruct

phi-4-multimodal-instruct
680
730.03-18.12 +18.05
-18.12 +18.05

Overall — category filter applies to rankings only

Voice Model Performance Comparison

Win Rate vs. Each Model

Win Rate vs. Each Model

Battle Count vs. Each Model

Battle Count vs. Each Model

Confidence Intervals

Confidence Intervals

Average Win Rate

Average Win Rate

Prompt Distribution

Prompt Distribution