Generated by Codex with GPT-5
What happened
Anthropic’s official research blog published Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench, a post about building a benchmark for agentic scientific work that is harder to game than ordinary question answering and closer to the messy workflows used in computational biology.
The motivating problem is that many AI science benchmarks still resemble exams. They test knowledge, reasoning, or a bounded simulation, but real bioinformatics work involves reading papers, choosing tools, downloading reference data, writing analysis code, dealing with noisy measurements, and deciding which evidence is strong enough to trust. Anthropic argues that this makes scientific evaluation unusually awkward: there are often many defensible methods, researcher choices can change conclusions, and some of the most valuable questions are precisely the ones humans have not solved yet.
BioMysteryBench is Anthropic’s attempt to preserve the messiness without giving up objective grading. The benchmark contains 99 bioinformatics questions written by domain experts across data types such as whole-genome sequencing, RNA-seq, single-cell RNA-seq, ChIP-seq, methylation, metagenomics, Hi-C, proteomics, and metabolomics. Each question is built around an objectively verifiable property of a dataset or orthogonally validated metadata, rather than a subjective scientific conclusion. That lets a model be graded on its final answer while still having freedom to choose its own analysis route.
The task environment matters as much as the questions. Claude is placed in a container with canonical bioinformatics tools, permission to install more tools through package managers, and access to scientific databases such as NCBI and Ensembl. That setup turns the eval into a constrained research workspace: the model has to plan, inspect data, run tools, write code, and combine evidence instead of merely recalling facts from pretraining.
Anthropic also built in human baselines. Up to five domain experts attempted each problem from scratch. If at least one expert solved the problem, it went into the human-solvable set; after quality control, the remaining human-difficult set contained 23 problems that experts could not solve but that still had validated signal in the data. This split is one of the post’s most useful design choices because it separates ordinary expert-level performance from the more speculative question of whether frontier models can find paths that humans miss.
Why it matters
The headline result is that Claude’s bioinformatics performance improves sharply across model generations, with recent models performing at or above expert baselines on many tasks. Anthropic says newer Claude models solved much of the human-solvable set and that Claude Sonnet 4.6 and stronger models solved meaningful fractions of the human-difficult set, with Claude Mythos Preview reaching a 30% solve rate there.
The more interesting engineering result is not the absolute score. It is the evaluation shape underneath the score. Anthropic found that models often solved ordinary expert-solvable problems reliably, while wins on human-difficult problems were much more brittle. In Claude Mythos Preview’s own analysis, human-solvable problems tended to be either solved consistently or not at all. On human-difficult problems, correct answers were more often one-off successes across repeated attempts. That distinction matters because production scientific assistance needs repeatable procedures, not just occasional impressive trajectories.
BioMysteryBench is useful because it makes that reliability gap visible. A simple accuracy number would make the benchmark look like another capability leaderboard. Repeated attempts per problem expose whether a model has a stable method, is drawing on latent knowledge, or is sometimes stumbling into a useful analysis path. For research-engineering teams, that suggests a general eval pattern: score final outcomes, but also measure per-task consistency so brittle agent behavior is not mistaken for dependable skill.
The post also shows why tool-rich scientific agents are difficult to evaluate with purely human-authored rubrics. Anthropic observed that Claude sometimes used strategies different from human experts, including pattern recognition from broad biological knowledge and layered evidence from multiple methods. A method-based grader would risk marking those trajectories wrong because they do not match a human solution path. A conclusion-based grader can tolerate creative routes, but only if the benchmark author has created objective answers and validation notebooks to prove the signal is really present.
That design has broader implications beyond biology. Many valuable agent tasks sit in the same zone: there are multiple valid paths, noisy intermediate signals, and a final answer that can be checked more easily than it can be found. Security investigation, data analysis, incident response, and codebase migration all have versions of this structure. BioMysteryBench is interesting because it operationalizes the “verify easier than solve” principle in a real scientific domain rather than a toy environment.
There is also a caution embedded in the result. A model that can outperform a small panel of experts on some hard tasks is not automatically a reliable scientist. It may have vast prior knowledge and enough tool skill to find shortcuts humans miss, but it can still be inconsistent, over-reliant on prior assumptions, or sensitive to the path it takes through an analysis. The benchmark’s human-difficult set is valuable precisely because it reveals both sides: frontier models are starting to open new scientific workflows, but their strongest-looking wins may be the ones most in need of replication and audit.
Takeaway
Anthropic’s post is a strong example of modern AI evaluation moving from static tests toward agent workspaces. The important idea is not only that Claude can solve bioinformatics tasks. It is that the benchmark was designed around objective data properties, open-ended methods, tool use, expert baselines, and repeated attempts, which together make it harder to confuse lucky success with durable capability.
For engineering teams building or evaluating agents, the broader lesson is to make evals resemble the work being delegated. If the product expects an agent to explore data, install tools, write code, consult external resources, and synthesize uncertain evidence, the eval should exercise those same muscles. The next useful capability metrics will not just ask whether a model can answer a question. They will ask whether it can repeatedly navigate a real workspace, choose defensible methods, and leave behind an answer that other experts can verify.