Latest Posts

Blog

News, insights, and updates from our team.

4 articles

How Profession Shapes LLM Usage: Insights from SEAL Showdown

11 min read

We analyze 580k+ production prompts and 100k+ preference battles from SEAL Showdown to study how profession shapes LLM usage. We find that professional background—independent of topic—predicts prompt difficulty, task type, and model preference, with domain experts asking harder in-domain questions and ranking models differently. These results motivate profession-aware evaluation of LLMs in expert workflows.

J
J
D
Janie Gu, Jaehwan Jeong, David Lee, Bing Liu, Zihao Wang

MoReBench: Evaluating the Process of AI Moral Reasoning

11 min read

MoReBench is a benchmark designed to evaluate the procedural moral reasoning of large language models. Using expert-authored rubrics across diverse ethical scenarios, it scores models on the structure and coherence of their reasoning rather than task outcomes. Our findings show that moral reasoning remains weakly correlated with established benchmarks and warrants targeted evaluation and training.

M
M
Brandon Handoko, Matthew Siegel, Mike Lee

Training Robust Multi-Turn LM Agents with On-Policy Expert Corrections

8 min read

In our recent work, Imitation Learning for Multi-Turn LM Agents via On-Policy Expert Corrections, we expose the problem of covariate shift in SWE LM agents and propose a simple, practical fix that significantly improves training efficiency and agent robustness.

N
Niklas Lauffer

Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback Loops

10 min read

We demonstrate that reinforcement learning can be used to fine-tune agents within realistic enterprise environments, leveraging task-specific feedback and structured rewards to substantially improve performance metrics compared to baseline models.

J
V
G
Jerry Chan, Vijay Kalmath, George Pu, Sam Denton