New Benchmark Evaluates AI Agents in Preclinical Drug Discovery

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman· June 18, 2026 View original

Summary

Researchers introduce TxBench-PP, a new benchmark designed to evaluate AI agents' performance in small-molecule preclinical pharmacology. The benchmark tests agents' ability to draw accurate conclusions from real-world assay data, revealing that current AI systems do not reliably recover preclinical pharmacology decisions.

A new benchmark, TherapeuticsBench Preclinical Pharmacology (TxBench-PP), has been developed to assess the capabilities of AI agents in the critical area of small-molecule preclinical pharmacology. This benchmark aims to provide a verifiable evaluation framework for AI in drug discovery, focusing on whether agents can interpret real-world assay data and make sound program decisions, rather than relying on memorized information. TxBench-PP includes 100 evaluations covering various program stages, assay types, and task structures, such as mechanism-of-action reasoning, compound-target engagement, and safety assessments. Agents are presented with realistic workflow scenarios and must inspect files in a coding environment to provide structured answers. Initial evaluations across 16 model configurations, involving 11 different AI models, demonstrated that no system consistently achieved reliable preclinical pharmacology decisions. The top-performing configuration, Claude Opus 4.8 / Pi, passed only 59.3% of endpoint attempts, indicating significant room for improvement in AI's ability to handle complex drug discovery tasks.

Why it matters

This benchmark provides a crucial tool for pharmaceutical companies and AI developers to rigorously test and improve AI agents for drug discovery, potentially accelerating the development of new therapeutics. Professionals can use these findings to understand the current limitations of AI in preclinical research and guide future AI integration strategies.

How to implement this in your domain

  1. 1Integrate TxBench-PP into AI development pipelines for drug discovery to validate model performance.
  2. 2Focus AI research efforts on improving reasoning capabilities for complex pharmacological data interpretation.
  3. 3Collaborate with AI researchers to develop more robust AI agents capable of reliable preclinical decision-making.
  4. 4Utilize the benchmark's structure to identify specific weaknesses in current AI models related to drug discovery tasks.

Who benefits

PharmaceuticalsBiotechnologyHealthcareAI Research

Key takeaways

  • TxBench-PP is a new benchmark for evaluating AI agents in small-molecule preclinical pharmacology.
  • Current AI systems do not reliably make preclinical pharmacology decisions, with top models achieving less than 60% accuracy.
  • The benchmark focuses on real-world data interpretation rather than memorized facts.
  • It highlights the need for significant advancements in AI reasoning for drug discovery.

Original post by Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

"arXiv:2606.19245v1 Announce Type: new Abstract: Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce Ther…"

View on X

Originally posted by Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses