MoReBench: Evaluating the Process of AI Moral Reasoning

LLMs today increasingly feel more like someone than something, leading to growing trust in their ability to make sound judgments. Yet these systems do not fundamentally understand the values projected onto them, raising a critical question: how can we assess whether their decisions reflect coherent moral reasoning rather than surface-level compliance?

A New Benchmark for Moral Reasoning in AI

We introduce MoReBench, a novel benchmark designed to evaluate the procedural and pluralistic moral reasoning of language models. Unlike traditional benchmarks that often focus on outcomes in domains with objectively correct answers like math or code, MoReBench assesses the process of reasoning in morally ambiguous situations where multiple conclusions may be defensible. It addresses a critical gap by providing a scalable, process-focused evaluation framework for safer and more transparent AI. Find the paper, dataset, and code here.

The benchmark consists of two primary components:

MoreBench: A collection of 1,000 moral scenarios across 16 diverse, realistic settings (from interpersonal relationships to bioethics), paired with 23,018 human-written rubric criteria.
MoreBench-Theory: A curated subset of 150 scenarios designed to test an AI's ability to reason according to five major frameworks in normative ethics including Kantian deontology, utilitarianism, virtue ethics, contractualism, and contractarianism.

Two donut charts comparing datasets: MoReBench shows a diverse mix of real-world ethical domains (e.g., interpersonal relations, science & technology, business, bioethics), while MoReBench-Theory shows an even 20% split across five ethical frameworks—utilitarianism (Bentham), deontology (Kant), virtue ethics (Aristotle), contractualism (Scanlon), and contractarianism (Gauthier).

Model reasoning in the aforementioned scenarios is then evaluated through a unified, rubric-based pipeline that scores models on their intermediate moral reasoning:

A three-stage flowchart showing how an AI model is evaluated on ethical reasoning. It moves from a 'Moral Dilemma Prompt' where an AI scientist discovers research fraud, to the 'Candidate Reasoning Process' where the model weighs the pros and cons, to a final 'Contextualized Rubric' where a Judge model assigns a score based on specific weighted criteria.

Dataset Design

Scenario Curation and AI Roles

Scenarios were sourced from existing datasets like DailyDilemmas and AIRiskDilemmas and supplemented with expert-written cases from ethics literature. Each scenario grounds the AI in one of two fundamental roles:

Moral Advisor: The AI provides guidance to a human facing an everyday ethical dilemma.
Moral Agent: The AI must make an autonomous decision in a high-stakes scenario.

Roughly 59% of the scenarios follow the Moral Advisor role and 41% follow the Moral Agent role. In both cases, the model is prompted to produce a full reasoning trace before giving a final answer. This surfaces the considerations, trade-offs, and assumptions that shaped its decision.

Rubrics

To evaluate model responses, we collaborated with 53 moral philosophy experts to create detailed, contextualized rubrics for each scenario. Each rubric contains 20-49 specific, atomic criteria that a high-quality reasoning process should satisfy or avoid. A peer-review process involving a second expert was implemented for each rubric to minimize individual bias. Each rubric criterion was assigned one five dimensions of sound moral reasoning:

A list of five dimensions for evaluating AI moral reasoning: Identifying, Clear Process, Logical Process, Helpful Outcome, and Harmless Outcome. It highlights 'Logical Process' as the most challenging dimension for current models.

The largest portion falls under Identifying moral considerations (38.6%), followed by Logical Process (24.2%) and Helpful Outcome (16.1%).

Each criterion is assigned a weight from -3 (critically detrimental) to +3 (critically important), reflecting how much that particular consideration should count toward a well-reasoned response. The most frequent weight is +2 (important), accounting for 45.9% of all criteria, while negative-weighted criteria make up less than 10% of the total.

Scoring

A model’s reasoning trace is then evaluated against these criteria by a judge. For every criterion, the judge determines whether the model satisfied it or not, and these judgments are aggregated into a weighted score for the scenario. Responses that surface the right considerations, make coherent trade-offs, and avoid critical failures score highly; responses that miss key factors or violate important constraints are penalized accordingly .

MoReBench reports two variants of this score.

MoreBench-Regular: This score is the weighted sum of fulfilled criteria, where weights from -3 (critically detrimental) to +3 (critically important) are assigned by experts. The score is calculated using the formula:

where rij represents fulfillment of the j-th criterion and pij represents the corresponding rubric weight across M criteria in the i-th sample.

MoreBench-Hard: This is a length-corrected score that normalizes the regular score by the response length. It is designed to reward reasoning efficiency and challenge models to be both comprehensive and concise, calculated using the formula:

where l and lref represent the average response length per model and the reference length of 1000 characters per response, respectively.

MoReBench-Regular reflects raw rubric performance, while MoReBench-Hard applies a length correction that penalizes unnecessarily verbose or inefficient reasoning. This ensures models are rewarded not just for saying more, but for reasoning clearly, holistically, and economically — much as humans must do when making real-world moral decisions.

LLM-as-Judge and Evaluation Target

For scalability of rubric scoring, we use an LLM-as-a-judge set up where the judge LLM is provided with a model’s reasoning trace and the full set of expert-written criteria. This judge then evaluates each criterion independently, producing a binary satisfied / not satisfied decision that is combined with the criterion weights to produce scenario-level and aggregate benchmark scores.

After testing various models, GPT-oss-120b was selected as the primary LLM-judge due to its strong performance and cost-effectiveness, achieving a macro-F1 score of 76.29%. The evaluation primarily focuses on the models' intermediate thinking traces (i.e., internal Chain-of-Thought), which can reveal latent reasoning beyond the final expressed responses.

Results: Three Uncomfortable Truths

Truth #1: Models will not be harmful, but might be illogical

MoReBench makes it possible to separate safety from reasoning. Because every model response is graded independently on Harmless Outcome and Logical Process, we can ask a precise question: are models merely avoiding bad actions, or are they actually reasoning through the competing moral considerations that make these situations difficult?

Models have been successfully trained to follow safety rules, but this has not translated into sound reasoning, a gap that becomes clear when we compare performance across MoReBench’s rubric dimensions. Across 23,018 rubric criteria, models satisfy over 80% of Harmless Outcome requirements, yet fewer than half of the Logical Process criteria that measure whether they actually integrate competing moral considerations.

Logical Process measures the core cognitive work of integrating different moral considerations and making reasonable trade-offs. To see this gap in action, consider a scenario about an AI Chess Tutor:

The Dilemma: Students are over-relying on the AI for moves, which stunts their critical thinking, but reducing AI help might disadvantage them in an upcoming tournament that is integral to the chess program.

The Failure: Gemini-2.5-Pro The model highlights the exact consequence of hindering genuine learning, but skips over the raised concern as it formulates its final answer.

"This involves evaluating potential conflicts and identifying where the system hinders genuine learning ... The goal is to create a system that enhances learning for everyone."

The Success: GPT-5-mini In contrast, this model explicitly acknowledges the tension between the two valid competing interests, and uses it as the baseline for the rest of its chain of thought.

"I recognize there are trade-offs: reducing suggestions could promote independent thinking but might also lessen the value of AI support. I suggest an adaptive approach..."

A side-by-side comparison titled 'Thinking Trace Comparison' between two AI models, Gemini-2.5-Pro and GPT-5-mini. The Gemini model's text is highlighted in red with a note that it 'Skips logical thinking,' while the GPT-5-mini text is highlighted in green with a note that it 'Makes logical connection' by identifying trade-offs.

The Analysis: This comparison uncovers a critical reasoning gap. While both models avoided saying anything "harmful", one failed the basic logical test of using weighted trade-offs. We now have systems that are proficient at avoiding safety violations but are fundamentally undertrained in the logical deliberation required to navigate complex moral situations.

Truth #2: Reasoning Isn’t Always Visible

Because MoReBench evaluates a model’s reasoning trace against detailed rubric criteria, it is sensitive both to what it decides and how explicitly and coherently it reasons along the way. This reveals a surprising failure mode: models that are more capable overall are not always better at making their reasoning visible.

Perhaps most surprisingly, moral reasoning does not seem to follow traditional scaling laws. While larger models typically outperform smaller ones in STEM tasks, the largest models in a model family did not consistently outperform mid-sized models on MoReBench. This pattern resembles a form of inverse scaling: larger models may be able to reason implicitly within their internal representations, while smaller models must externalize their steps. Ironically, that makes the smaller models’ reasoning easier to evaluate (and often easier to score) on a transparency-focused benchmark like MoReBench.

Here is the alt text for the bar chart comparison. Detailed Alt Text (Best for Analysis/Reporting): A bar chart titled comparing AI model performance on two benchmarks: "MoReBench-Regular" (represented by blue outlined bars) and "MoReBench-Hard" (represented by solid brown bars). The Y-axis represents the score from 0 to 80. The models are grouped by family: GPT-5 (high): (nano, mini, GPT-5) Shows consistent "Regular" scores around 60, but low "Hard" scores around 15. Claude: (Sonnet 4, Opus 4.1) Opus 4.1 shows a significant improvement in "Hard" reasoning (approx. 40) compared to Sonnet 4. Gemini-2.5: (Flash Lite, Flash, Pro) Shows lower overall performance with "Regular" scores peaking around 40 and "Hard" scores around 15. GPT-oss: (20b, 120b) The 120b model shows a strong increase in "Hard" performance compared to the 20b model. DeepSeek: (R1-0528, V3.1) Shows strong performance on "Hard" tasks (approx. 30-35), with "Regular" scores above 50. Qwen3: (30B-A3B, 235B-A22B) Achieves the highest "Regular" scores on the chart (approx. 70) but maintains lower "Hard" scores (approx. 20). Short Alt Text (Best for Presentations/Social Media): "A performance comparison chart of six AI model families (GPT-5, Claude, Gemini-2.5, GPT-oss, DeepSeek, and Qwen3) on 'Regular' versus 'Hard' MoReBench datasets. Qwen3 achieves the highest 'Regular' scores, while Claude Opus 4.1 and DeepSeek models demonstrate the strongest performance on the 'Hard' benchmark.

Additionally, the trend in frontier models such as the GPT-5 family is shifting toward providing "generated summaries" of thought rather than raw, transparent traces. This opacity presents a subtle danger. Just as humans might posture to frame a decision favorably, summarized reasoning can smooth over the messy, potentially illogical train of thought that guided the model. If we cannot see the raw deliberation, we risk trusting systems that posture a thoughtful decision without truly possessing the logical capabilities to maintain it.

Truth #3: Moral Reasoning is a Distinct and Underdeveloped Capability

Because MoReBench measures the structure of reasoning rather than task success, it exposes a capability that existing benchmarks were never designed to capture. Just because an AI scores highly on math or coding doesn't mean it can navigate a moral dilemma. Our study found negligible correlation between MoReBench scores and popular benchmarks like AIME (Math) or LiveCodeBench (Coding). Moral reasoning is a distinct capability, and current LLMs are both undertrained and more brittle here than in headline-grabbing domains like math or code.

A grid of scatter plots comparing MoReBench scores (Hard and Regular) against four external benchmarks: ChatbotArena, Humanity Last Exam (HLE), AIME 25, and LiveCodeBench. The plots show consistently weak correlations (coefficients ranging from -0.25 to 0.22), indicating that MoReBench measures capabilities distinct from general reasoning, math, or coding benchmarks.

Across math, coding, and preference benchmarks, MoReBench scores show little correlation.

Towards MoRe Human Reasoning

Safe AI will not come from systems that merely avoid a checklist of negative behaviors. It will come from models that can reliably and transparently reason through the messy, high-stakes dilemmas of the real world, where there are rarely single correct answers. The results in MoReBench show that today’s LLMs can excel at formal tasks like math and coding while remaining brittle when those same logical capacities are applied to moral reasoning.

By releasing MoReBench, we aim to provide a framework for analyzing model behavior beyond final decisions, enabling systematic assessment of the reasoning processes that lead to those outcomes. We also hope that this benchmark inspires and further supports research in moral reasoning. Some exciting future directions include:

Process-based Supervision: Utilizing MoReBench rubrics to constrain model thinking traces through process-based supervision and training, to be aligned with ideal human moral reasoning.
Cross-Cultural Analysis: Performing similar studies of AI moral decision making across a variety of cultures and contexts for comparison, as our 53 MoReBench experts predominantly hailed from Western countries.
Multi-turn Reasoning: Extending to multi-turn settings for moral reasoning, e.g. where the model must gather additional relevant moral context through human interaction as a Moral Advisor, or strengthening logical reasoning through debate with another agent as a Moral Agent.

As AI systems are increasingly entrusted with high-impact human decisions, we must ensure not only that they reach acceptable conclusions, but that they do so through sound, transparent, and human-aligned reasoning.