DrugDiscoveryBench

What DrugDiscoveryBench Measures

DrugDiscoveryBench evaluates how reliably frontier coding agents perform the multi-step, computational, and information-retrieval work of early-stage drug discovery. The 82 tasks span the early-discovery pipeline: target identification and validation, hit identification across patents, databases, and literature, hit-to-lead and structure-activity analysis, and lead optimization. The benchmark stops before candidate selection; DMPK, toxicology, formulation, and clinical work are out of scope.

Release Artifacts

DrugDiscoveryBench is released as a set of public artifacts so that results can be reproduced and extended.

Research paper: the full benchmark description, methodology, per-setting results, and failure analysis.
DrugDiscoveryBench environment and evaluation harness: the adapted Biomni build the agents run inside, the container definition, and the Harbor task format and LLM-judge grading setup.
Task set and rubrics: 82 tasks, each with its prompt, ground truth answer, and weighted rubric.

Reproducibility caveats, including temporal drift on date-stamped tasks, are detailed in the paper.

Key Takeaways

Even the best-performing models successfully completed only a little over half of the benchmark’s tasks. Pass rate (our primary metric) scales cleanly with test-time compute within each model family, with GPT-5.5 in Codex climbing from 27.6 at low effort to 45.1 at xhigh and Opus 4.8 in Claude Code climbing from 29.3 to 47.2 max, each gaining roughly 18 points for four to five times the output tokens. One generation of frontier development is worth 8 to 20 points on this benchmark, comparable to the spread across the three current frontier models.

The frontier is tight. GPT-5.5 in mini-SWE-agent leads at 51.6, with Gemini 3.5 Flash (50.0 in Gemini CLI) and Opus 4.8 (47.2 in Claude Code) close behind. The bottom of the field sits roughly 25 points lower: MiniMax M3 at 22.8, Sonnet 4.6 at 24.0, Opus 4.6 at 27.7.

Harness choice can move scores within a model family. GPT-5.5 scores 6.5 points higher in the open-source mini-SWE-agent harness (51.6) than in its native Codex (45.1). The effect is smaller for other models (Opus 4.8 is essentially flat between Claude Code and mini-SWE-agent), but the GPT-5.5 swing shows the read-edit-run loop and context management are a real lever distinct from the underlying model.

What separates models is unguided high-level planning. The expert-method recovery experiment supplies agents with the step sequence and tools (never the answer) and flips 76 of 82 tasks to solved, locating the gap in planning rather than execution.

When agents fail, it is usually because they drop a critical constraint mid-trajectory (for example, forgetting to filter for a specific disease) or misinterpret which data representation to use. See the paper for trajectory-level analysis.

Key Metrics

82 tasks across 7 capabilities and 4 lifecycle stages
920 total rubric criteria
226 Biomni functions across 22 domains, plus a 76-file data lake and 117 preinstalled scientific packages
12 model×effort settings on the leaderboard
Scoring is calculated by averaging three runs per model, not best-of-n, with a 120-minute agent timeout
64 of 82 tasks solved by at least one setting; 18 solved by none

How to Read the Leaderboard

Metrics

Each task is graded by an LLM judge against an expert-written weighted rubric, producing an outcome score from 0 to 100 (see Scoring and Judging for how the rubric is scored). We summarize agent performance two ways.

Pass rate is the share of tasks a setting solves, where a task counts as solved when its outcome score is 100. It treats each task as a clean success or failure, which reflects how a practitioner would actually use the result: a partially correct answer to a single-answer task is still a wrong answer. Non-completions (refusals, timeouts, errors, or missing answer files) score 0, so they count against the pass rate rather than being excluded.

Mean outcome score is the average outcome score across all tasks. Unlike pass rate, it credits partial progress, so a setting that gets most of a rubric right without fully solving the task still earns credit. We report it in the paper rather than on the leaderboard, as a finer-grained companion to the headline pass rate.

Measuring Settings

The leaderboard reports 16 settings (paper has more settings tested). Each closed-source frontier model appears in its native harness and in mini-SWE-agent (the universal fallback); open-source models appear in mini-SWE-agent only. Token cost is reported alongside pass rates as mean output tokens per task. Compute is a first-class axis.

Coverage Signatures

Coverage signatures differ by model. GPT-5.5 in Codex never fails to finish a task across all three effort settings. Opus 4.8 loses 5 to 6 tasks per setting to false-positive refusals on routine structural and patent analysis of already-published data.

Harness is listed per run (Codex, Claude Code, Gemini CLI, mini-SWE-agent). The read-edit-run loop and context management are part of what is being measured alongside the model itself, and the same model in different harnesses can score 18 or more points apart.

Dataset Design

Tasks and Answers

The 82 tasks are authored, reviewed, and graded by domain experts in pharmaceutical and biomedical research. Each is grounded in a real artifact, either attached or should be found online: a PDB structure, a granted patent, a research paper, or a database record. Each resolves to a single verifiable answer (an integer, a float, a SMILES or InChIKey string, a ranking, or a short table) at the end of a multi-step retrieve, parse, filter, and compute workflow.

A Representative Task

A representative target-identification task asks the agent to name the top-ranked protein drug target for basal cell carcinoma that has at least one pathogenic or likely-pathogenic variant for the disease in a clinical variant database, is localized to the cell membrane or secreted, is not annotated as a tumor suppressor, and wins a ranking by UniProt isoform count with ties broken by pathogenic-variant count. The output is constrained to a gene symbol and full protein name (ground truth: SMO, protein smoothened).

Two Taxonomies

Tasks are organized along two axes. The capability axis sorts the work into seven categories: structural reasoning, database screening, patent mining, target ID and genetics, cheminformatics, molecular biology, and SAR/affinity ranking. The lifecycle axis maps each task onto the early-discovery pipeline; hit-to-lead is densest at 34 tasks, lead optimization thinnest at 10. Eight additional tasks cover adjacent biomedical work: protein engineering, proteomics, and cancer genomics.

Workflow length is measured in expert-defined steps, not agent tool calls: 3 to 12 per task, mean 6.3, with a bimodal distribution. Compact lookup-and-compute tasks cluster at the low end; longer pipelines tail out, where a slip at any one dependent step sinks the whole result.

Authoring Pipeline

Every task passes through a curated authoring pipeline: expert authoring, second-expert review, BenchGuard automated robustness checks, a solvability assessment against the author's playbook, SME or senior-annotator fixes, and independent QC sign-off before release. Tasks that can't be made robust are dropped.

Evaluation Methodology

Environment

Agents work inside Scale-Biomni, an adapted Biomni build that exposes 226 functions across 22 domains alongside a 76-file data lake and 117 preinstalled scientific packages. Tools are made available as an ordinary Python library: the agent imports and chains functions natively, with no MCP wrapper between the model and the tool surface. General-purpose coding agents are evaluated as-is, with no bespoke biomedical scaffolding. The interface is general; the work is biomedical.

Task Presentation

Tasks are presented in Harbor format. The agent receives a task-specific prompt followed by a shared footer pointing at the on-container references, the data lake, and the requirement to write only the final answer to /workspace/answer.md. Native harnesses are used where available (Codex, Claude Code, Gemini CLI), with mini-SWE-agent as the cross-model fallback. The harness handles the read-edit-run loop, tool invocation, and context management.

Reasoning Effort

Reasoning effort is the second axis. Each frontier family is run at three settings (low, medium, ceiling), where ceiling is the model's highest available level: xhigh for GPT, max for Opus, high for Gemini. Every cell is three runs averaged under a 120-minute agent timeout. Runs that fail for transient reasons (e.g., network errors) are retried.

Expert-Method Recovery

A separate expert-method recovery experiment isolates planning from execution. Three frontier agents (Opus 4.8, GPT-5.5, Gemini 3.5 Flash) are given expert-authored playbooks alongside the original task prompt; the playbooks describe the sequence of steps and which tools to use, but never the answer. At least one agent solves 76 of the 82 tasks under this setup, locating the primary gap in unguided high-level planning rather than tool execution.

Scoring and Judging

An LLM judge (GPT-5.4) grades the agent's final answer.md against an expert-written rubric in a single pass. Because the rubric is a fixed list of atomic checks and most tasks resolve to a single value or short structured object, the judge's job is closer to checking discrete claims than to open-ended assessment, which keeps grading stable across reruns. Cross-checking GPT-5.4 against Gemini 3.5 Flash and Sonnet 4.6 as alternate judges shows near-perfect agreement, since the rubrics are terse and leave little room for interpretation.

Each task carries two rubric types. Outcome criteria verify the final answer; process criteria check the methodology path (did the agent retrieve the right structure, query at the right pH, apply the stated filter). Process criteria are collected and released for research purposes but do not enter the primary score.

Given mostly short-form answers hence a terse list of corresponding rubrics, a task counts solved if the agent gets 100%. The judge's numeric tolerance is tight: it forgives last-digit rounding on computed floats but not on identifiers, accession IDs, integer counts, or categorical answers, which must match exactly.

Performance Comparison

GPT-5.5 (mini-SWE-agent) xhigh

51.60±4.30

Gemini 3.5 Flash (Gemini CLI) high

50.00±2.40

Gemini 3.5 Flash (mini-SWE-agent) high

48.80±2.10

Claude Opus 4.8 (mini-SWE-agent) max

46.80±1.40

Claude Opus 4.8 (Claude Code) max

45.10±4.38

GPT-5.5 (Codex) xhigh

45.10±4.90

Gemini 3.1 Pro (Gemini CLI) high

41.90±4.60

GLM 5.2 (mini-SWE-agent) xhigh

36.20±3.70

Kimi K2.7 Code (mini-SWE-agent) xhigh

35.30±2.10

DeepSeek V4 Pro (mini-SWE-agent) xhigh

31.70±3.20

Claude Sonnet 4.6 (mini-SWE-agent) max

31.30±0.70

Qwen 3.7 Max (mini-SWE-agent) xhigh

29.30±2.50

GPT-5.2 (Codex) xhigh

29.30±3.20

Claude Opus 4.6 (Claude Code) max

27.70±6.70

Claude Sonnet 4.6 (Claude Code) max

24.00±0.70

MiniMax M3 (mini-SWE-agent) xhigh

22.80±4.60

All leaderboards