Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
BACK
ResearchMarch 20, 2026

Voice Showdown: An In-the-Wild Preference Arena for Voice AI

ByAdvait Gosai, Janie Gu, Bing Liu, Mohamed Elfeki

TL;DR

  • Voice Showdown is the first global preference arena for voice AI, evaluating 11 frontier models through blind comparisons embedded in real user conversations across 60+ languages and diverse acoustic environments.
  • Gemini 3 Pro and Flash lead Dictate (speech-in, text-out), while Gemini 2.5 Flash Audio leads S2S baseline rankings.
  • Multilingual performance is a key differentiator: top models lead across a diverse set of languages, while models such as GPT Realtime 1.5 frequently mismatch response languages due to upstream understanding errors on short, noisy, in-the-wild prompts.
  • User feedback reveals distinct model weaknesses: Qwen 3 Omni fails almost entirely on speech generation; GPT Realtime 1.5 on audio understanding (51%); and Grok Voice’s failures are evenly balanced across understanding, content, speech output.
  • Battles served on early turns are dominated by audio-understanding losses, while battles deeper in conversation (Turn 11+) shift toward content-quality failures. Short utterances surface comprehension gaps and longer prompts expose reasoning limitations.

1. Introduction

We introduce Voice Showdown, an industry-first preference arena for audio-native AI models. Rankings are determined by real users on ChatLab, Scale’s model-agnostic chat platform. The platform spans 60+ languages and diverse acoustic features including accents, dialects, background noise, mid-utterance repairs and varying recording environments. Unlike static benchmarks or controlled user studies, Voice Showdown captures how people actually interact with audio-native AI within their personal daily conversations.

Voice Showdown currently consists of two leaderboards:

  • Dictate (Speech In, Text Out): Users speak a prompt and receive two text responses side by side.
  • S2S (Speech-to-Speech): Users speak and listen to two spoken responses, then indicate why they disliked the losing model across three diagnostic axes.

We evaluate 11 frontier voice models across 52 model-voice pairs as of March 18, 2026.

UI for ChatLab and Voice Showdown

2. Methodology

We largely extend the Text Showdown with modality-specific additions described below.

2.1 Data Collection

Our data collection follows an in-situ approach where users converse freely with voice models on ChatLab for their day-to-day model usage. On average, we serve SxS battles on < 5% of all voice prompts on ChatLab, ensuring they reflect genuine user queries or tasks users are actively working through.

We additionally ensure that SxS battles are not served on short utterances < 5 seconds and filter low-complexity filler prompts such as greetings and acknowledgements (“okay”, “got it”, “thanks”).

Voting. Each battle yields a rating h ∈ {0, 0.5, 1} indicating preference for m₁, a tie (Both Good or Both Bad), or preference for m₂. For S2S, users must listen to at least 3 seconds from both models before voting.

After voting, S2S users indicate why they disliked the losing model via a multi-select from three categories: Model misheard what I said (Understanding), Model response was insufficient (Content Quality), and Model sounded worse (Speech Generation). Both Dictate and S2S users may also provide a free-text justification for their preferences. This supplementary feedback does not enter our Elo computation, but is valuable to identify loss areas unique to each model.

Data Profile

The distribution of prompts in Voice Showdown reflects ChatLab’s global user base, with 60+ languages represented across 6 continents. English accounts for 65% of battles, with over a third in non-English languages including Spanish, Arabic, Japanese, Portuguese, Hindi, and French.

Our diverse set of data enables us to draw insights on large-scale multilingual model capability, which can be further specified to regional locales.

Furthermore, since battles occur in-situ within users’ daily conversations, they naturally span a wide range of conversational turns, from single turn prompts that gauge first impressions to extended multi-turn dialogues where models must maintain coherence over many exchanges.

Over half of battles occur beyond Turn 1, with some occurring past 20 turns over multiple chat sessions, giving us a unique window into how models hold up in sustained, multi-turn dialogue (see Section 3.3).

Voice Showdown prompts are primarily conversational. Chitchat (40%) and Open QA (33%) account for roughly three-quarters of all prompts, followed by Brainstorming (8%). Technical tasks each constitute less than 3%, reflecting the organic nature of voice interaction compared to text, which skews more heavily toward Coding and Reasoning.

This further emphasizes the need for a human preference-based arena for voice AI models, as most interactions do not include requests with objective, verifiable answers.

Median prompt duration is 11 seconds across both leaderboards, with 75% of prompts under 20 seconds and a long tail past 40 seconds which typically consists of detailed, scenario-based or context-heavy requests.

2.2 Battle Sampling

Each SxS battle in Voice Showdown consists of two models: the user’s currently selected model (denoted as in-flow) and a sampled opponent. This design simulates a model-switch scenario, testing whether an alternative model produces a better response to the same prompt within the user’s organic conversation.

We follow the same active sampling strategy as Text Showdown, where candidate pairs are first enumerated and sampling probabilities are updated dynamically to prioritize under-evaluated matchups with high win-rate uncertainty.

Ensuring blind voting. During S2S conversations, a user can identify their in-flow model by its voice, breaking battle anonymity. To prevent this, battle candidates are defined as model-voice pairs (m, v) and the candidate pool excludes the user’s currently selected voice, ensuring the in-flow model is always heard with a new voice:

C={((min,v1),(m,v2)):v1∈V(min)∖{vin},m∈MS∖{min},v2∈V(m)}C = \{ ((m_{\text{in}}, v_1), (m, v_2)) : v_1 \in V(m_{\text{in}}) \setminus \{v_{\text{in}}\}, m \in M_S \setminus \{m_{\text{in}}\}, v_2 \in V(m) \}C={((min​,v1​),(m,v2​)):v1​∈V(min​)∖{vin​},m∈MS​∖{min​},v2​∈V(m)}

Controlling for voice gender. Output voice gender can influence preference independently of model quality. Since this is determined by which voice is sampled rather than the user’s prompt, we control for it programmatically in S2S battles by restricting candidate battle pairs to same-gender pairings via a predefined mapping G : V → {masc., fem.}:

C={((min,v1),(m,v2)):v1∈V(min)∖{vin},m∈MS∖{min},v2∈V(m),G(v1)=G(v2)}C = \{ ((m_{\text{in}}, v_1), (m, v_2)) : v_1 \in V(m_{\text{in}}) \setminus \{v_{\text{in}}\}, m \in M_S \setminus \{m_{\text{in}}\}, v_2 \in V(m), G(v_1) = G(v_2) \}C={((min​,v1​),(m,v2​)):v1​∈V(min​)∖{vin​},m∈MS​∖{min​},v2​∈V(m),G(v1​)=G(v2​)}

2.3 Ranking

Ranking follows the Text Showdown Bradley-Terry framework, where the probability that model m₁ is preferred over m₂ is:

P(m1≻m2)=σ(βm1−βm2)P(m_1 \succ m_2) = \sigma(\beta_{m_1} - \beta_{m_2})P(m1​≻m2​)=σ(βm1​​−βm2​​)

where β_m is a learned strength coefficient for model m and σ is the logistic function.

For S2S, since each model supports a variety of output voices, we adapt the MLE formula to project each model-voice pair onto its base model, so all voices contribute to a single β_m:

β^=arg⁡min⁡β1N∑kℓ ⁣(h(k),σ ⁣(βmodel(m1(k))−βmodel(m2(k))))\hat{\beta} = \arg\min_{\beta} \frac{1}{N} \sum_k \ell\!\left( h^{(k)}, \sigma\!\left( \beta_{\mathrm{model}(m_1^{(k)})} - \beta_{\mathrm{model}(m_2^{(k)})} \right) \right)β^​=argβmin​N1​k∑​ℓ(h(k),σ(βmodel(m1(k)​)​−βmodel(m2(k)​)​))

A model’s Elo therefore reflects aggregate capability across its full voice catalog. Per-voice analysis is conducted separately to curate tailored insights for model providers.

We designate Qwen 3 Omni as the anchor model with β = 1000 for both leaderboards as a competitive open-weight model that supports both Dictate and S2S configurations.

2.4 Controls

User preferences in pairwise comparisons can be influenced by factors unrelated to model quality, such as response length, load time or formatting. We address load times by holding both responses until the first token has been received from each model, so both begin streaming at max(TTFT_A, TTFT_B). Voice gender is controlled programmatically via battle sampling as described in Section 2.2.

Baseline Controls

We use the augmented Bradley-Terry model from Text Showdown for Baseline and Style controls, which jointly estimates model strength coefficients β and control parameters γ for confound features φ.

We control for side (position of Response A vs. B) and in_flow (which response comes from the user’s in-flow model) in the Baseline rankings. In Dictate, γ_side = −0.01 and γ_in_flow = 0.01, whereas in S2S, γ_side = 0.05 and γ_in_flow ≈ 0. The side coefficient in S2S reflects a slight first-response bias most likely due to Response A’s audio being always played first, giving it a marginal advantage which is controlled for by default. The near-zero in_flow coefficients confirm users show no systematic preference for their current model, validating our voice-switching strategy (Section 2.2).

Style Controls

We additionally control for verbosity (token count difference between responses) and formatting richness (markdown rendering difference) as features for our Style controlled rankings. We observe that users prefer longer, more detailed responses in both Dictate and S2S. Markdown is the strongest confound in Dictate (γ = 0.31), where rich formatting strongly influences preference. In S2S it is more modest (γ = 0.07), as users primarily evaluate the audio but still consult the transcript displayed alongside each response.

Our leaderboard shows both baseline and style-controlled rankings through a toggle for full visibility on models’ performance.

3. Results

Our initial release evaluates 11 frontier voice models as of March 20, 2026 across Dictate and S2S (52 model-voice pairs), covering the most capable API and open-source models that we self-host.

3.1 Rankings

Rankings on Voice Showdown are real-time and updated once a day. The following rankings reflect scores as of March 18, 2026.

Dictate:

RankModelElo95% CIElo (Style Control)95% CI
1Gemini 3 Pro1073[-10, +9]1044 (-29)[-16, +14]
1Gemini 3 Flash1068[-10, +12]1043 (-25)[-15, +12]
3GPT-4o Audio1019[-10, +10]1015 (-4)[-11, +12]
3Qwen 3 Omni1000[-16, +15]1000[-20, +15]
5Voxtral Small925[-26, +24]941 (+16)[-25, +20]
5Gemma3n918[-18, +14]943 (+25)[-18, +19]
7GPT Realtime875[-12, +13]926 (+51)[-11, +12]
8Phi-4 Multimodal729[-24, +29]797 (+68)[-27, +28]

Gemini 3 Pro and Gemini 3 Flash are statistically tied at #1. As one of the only frontier reasoning LLMs to support native audio input in their API (unlike GPT-5 or Claude) Gemini 3 Pro and Flash share the top at around 1043 Elo. GPT-4o Audio holds a tier of its own at 1014, and the open-source models Gemma3n, Voxtral Small, and Phi-4 Multimodal trail significantly.

Style control penalizes Gemini models (-25 and -21 Elo) primarily due to their response verbosity. GPT-4o Audio is the cleanest signal in the leaderboard, moving just 2 Elo under style control, meaning its ranking is almost entirely content-driven. GPT Realtime benefits most from style control (+55 Elo), with markdown accounting for the majority of its gain, suggesting it produces competitive content but presents it more plainly than peers.

S2S:

RankModelElo95% CIElo (Style Control)95% CI
1Gemini 2.5 Flash Audio1060[-12, +12]1075 (+15)[-15, +14]
1GPT-4o Audio1059[-15, +12]1102 (+43)[-11, +13]
3Grok Voice1024[-12, +12]1093 (+69)[-13, +18]
3Qwen 3 Omni1000[-14, +15]1000[-18, +21]
5GPT Realtime962[-13, +10]1015 (+53)[-11, +10]
6GPT Realtime 1.5920[-9, +15]973 (+53)[-15, +11]

Gemini 2.5 Flash Audio and GPT-4o Audio are statistically tied at #1 in the baseline rankings. After style control, GPT-4o Audio pulls ahead (1102 vs 1075), and Grok Voice jumps from #3 to a close #2 (1093). Unlike Dictate where models gain and lose Elo depending on their formatting behavior, S2S models gain under style control relative to the anchor due to response length. Grok Voice benefits the most from style control, driven almost entirely by response length, meaning its raw #3 ranking undersells its actual performance quality. Gemini 2.5 Flash Audio’s verbosity is closest to Qwen’s, similar to its Dictate counterparts, hence benefiting the least from style control.

GPT Realtime models struggle and both rank below the anchor with their baseline Elos. While GPT Realtime 1.5 performs strongly on multi-turn conversation in Audio MultiChallenge and function calling, Voice Showdown’s in-the-wild data surfaces its struggles on multilingual prompts and short, acoustically challenging utterances. Audio understanding accounts for close to half of its losses, which can cascade into entirely unrelated responses and language-switching (Section 3.2) for a significant number of prompts.

User verbatims post-battle confirm this:

“I said I have an interview today with Quest Management and instead of answering, it gave me information about ‘Risk Management’.”

“GPT Realtime 1.5 thought I was speaking incoherently and recommended mental health assistance, while Qwen 3 Omni correctly identified I was speaking a Nigerian local language.”

As a result, GPT Realtime 1.5 loses roughly three out of four head-to-head battles compared to its predecessor GPT Realtime. However, it improves on content quality failures when it’s able to correctly decipher user utterances and intent, suggesting the gap is specific to speech understanding rather than broad reasoning capability.

Our post-battle feedback for S2S allows for similar failure insights for each model we evaluate.

Qwen 3 Omni and Gemini 2.5 Flash Audio fail primarily on speech generation, whereas GPT Realtime 1.5 failures are dominated by audio understanding (51%). Grok Voice’s failures are evenly balanced.

3.2 Multilingual Performance

Multilingual capability, when it comes to both speech understanding and generation, is one of the clearest differentiators of model performance on Voice Showdown.

Dictate: Gemini 3 models lead across all languages. S2S: GPT-4o Audio leads in most non-English languages; GPT Realtime 1.5 falls below 50% in every non-English language.

Language Mismatch. A meaningful share of losses for certain models stems from language mismatch. GPT Realtime 1.5 responds in English on ~20% of non-English prompts compared to ~10% for GPT Realtime and ~7% for Gemini 2.5 Flash Audio and GPT-4o Audio, with Grok being the most stable. This problem runs in both directions as models also respond in non-English on English prompts at elevated rates, sometimes carrying context from earlier in the conversation from where the user has subsequently language-switched and sometimes mishearing the prompt entirely and generating unrelated content in the wrong language. Both patterns are most pronounced on short prompts in the bottom quartile of durations (under around 8 seconds) where limited acoustic context and higher variability in recording conditions such as background noise make language identification harder. We observe that this is not a tail-language problem, as mismatch occurs frequently in supported languages including Hindi, Spanish, Arabic, and Turkish.

Models present in both leaderboards (Qwen 3 Omni, GPT-Realtime, GPT 4o Audio Preview) show far less language mismatch failures in Dictate (outputting text) than in S2S (outputting speech), despite both forms receiving native audio input. The gap reflects insufficient multilingual coverage in S2S post-training.

3.3 Prompt Duration & Conversational Depth

S2S failure patterns shift along both depth of conversation at which battles were served and individual durations of user prompts.

Conversational depth. On Turn 1 battles, Content Quality accounts for 23% of model failures, but by Turn 11+ it becomes the primary loss driver at 43%. Most models decline in win rate as conversations extend across multiple turns and sessions, while GPT Realtime variants marginally improve, which is consistent with their relative strength on longer context evaluations.

Prompt duration. Failure categories shift predictably with prompt audio duration. Short prompts under 10 seconds are dominated by audio understanding failures (38%), while long prompts over 40 seconds shift toward content quality (31%) as the leading failure mode.

3.4 Acoustic Signals

Our large-scale, in-the-wild data enables rich acoustic analysis beyond what typical user studies can surface. We find that lower quartile SNR audio, generally featuring background noise, reverberation, or distant microphone placement, is associated with ~12% more audio understanding failures in S2S, pointing to a post-training gap that more acoustically diverse training data could address. The top performing Gemini models are the most robust to these conditions, while GPT Realtime 1.5 shows the highest sensitivity, degrading roughly twice as much as the field average.

Furthermore, our raw corpus enables prompt-level characterization across dimensions like speaking rate, accent distribution, recording environment, paralinguistic features, and disfluency patterns. This allows for a nuanced analysis of gaps in existing models and helps guide tailored data collection and evaluation suites for targeted improvements.

4. Conclusion

Voice Showdown provides an in-the-wild evaluation of voice AI models grounded in real user workflows, across 11 models and 60+ languages. Beyond rankings, it surfaces granular insights on where and why models fall short, giving users a transparent basis for model selection and model providers targeted signal for improvement in real-world settings.

What’s Next

We plan to expand Voice Showdown to full-duplex evaluation, where interruptions, barge-ins, and overlapping speech emerge naturally and cannot be reduced to side-by-side preference judgments. This motivates a ranking methodology beyond pairwise comparisons, designed to capture the dynamics of concurrent, bidirectional speech.

On this page1. Introduction2. Methodology2.1 Data Collection2.2 Battle Sampling2.3 Ranking2.4 Controls3. Results3.1 Rankings3.2 Multilingual Performance3.3 Prompt Duration & Conversational Depth3.4 Acoustic Signals4. ConclusionWhat’s Next
All posts

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy