Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
BACK
Science of DataPost-Training7/23/2025

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, Sean M. Hendryx

View paper

Learn how rubric-based reward signals enable reinforcement learning in non-verifiable domains, extending RLHF beyond math and code benchmarks.

Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce Rubrics as Rewards (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a 28% relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy