New Benchmark Evaluates LLM Agents with Scientific Simulator

New Benchmark Evaluates LLM Agents with Scientific Simulators

Ke Zhang, Sahchit Chundur, Mohammad Javad Qomi, Maziar Raissi· July 2, 2026 View original

Summary

Researchers introduce PHREEQC-MCQ-200, a benchmark with 200 multiple-choice questions for evaluating tool-augmented LLM agents on aqueous-geochemistry simulations. The study reveals that while simulator access improves accuracy, it can also lead to regressions, highlighting the need for comprehensive diagnostic evaluation beyond average accuracy.

Large language model (LLM) agents are increasingly integrated with scientific software, but the true impact of tool access on reliability versus complexity remains unclear. To address this, a new diagnostic benchmark, PHREEQC-MCQ-200, has been developed. This benchmark features 200 multiple-choice questions derived from 21 validated PHREEQC scenarios, requiring agents to interact with a scientific simulator to construct inputs, execute simulations, interpret outputs, and provide answers. The evaluation across various LLM families shows that access to a simulator significantly boosts overall accuracy, confirming that grounded execution is essential for many scientific computation tasks. However, the study also uncovered a critical finding: tool-augmented agents sometimes fail on items they previously answered correctly without tools, indicating hidden regressions that simple average accuracy metrics might obscure. Furthermore, the research emphasizes that the protocol for accessing simulator outputs matters. A structured "table-of-contents" interface can reduce token costs and maintain or improve accuracy for advanced models, but it can degrade performance for mid-tier models that struggle with navigating complex structured outputs. The paper advocates for a more holistic evaluation of scientific agents, including item-level retention, output-access sensitivity, trajectory failures, and identifying where the computational chain breaks.

Why it matters

For professionals developing or deploying AI agents in scientific or technical domains, this research provides crucial insights into the complexities of tool integration. It highlights the need for rigorous evaluation methods that go beyond simple accuracy to ensure reliability and prevent unexpected failures.

How to implement this in your domain

1Adopt comprehensive evaluation metrics beyond aggregate accuracy when developing tool-augmented LLM agents, including item-level retention and failure analysis.
2Design and test different output-access protocols for scientific tools, considering the capabilities of the LLM models being used.
3Implement diagnostic logging and tracing within agent trajectories to identify specific points of failure in the computation chain.
4Prioritize robust error handling and validation mechanisms for LLM agents interacting with external scientific software.

Who benefits

Scientific ResearchChemical EngineeringEnvironmental ScienceAI/ML DevelopmentPharmaceuticals

Key takeaways

Tool-augmented LLM agents improve scientific computation accuracy but can introduce new failure modes.
Comprehensive evaluation metrics, beyond average accuracy, are crucial for assessing agent reliability.
The design of output-access protocols significantly impacts agent performance, especially for mid-tier models.
Understanding where the computational chain breaks is vital for debugging and improving scientific agents.

Original post by Ke Zhang, Sahchit Chundur, Mohammad Javad Qomi, Maziar Raissi

"arXiv:2607.00436v1 Announce Type: new Abstract: Large language model agents are increasingly connected to scientific software, yet it remains unclear when tool access makes scientific computation more reliable rather than merely more complex. We introduce PHREEQC-MCQ-200, a bench…"

View on X

Originally posted by Ke Zhang, Sahchit Chundur, Mohammad Javad Qomi, Maziar Raissi on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New Benchmark Evaluates LLM Agents with Scientific Simulators

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

Human Feedback Guides Generative Meta-Learning for Robust Generalization.

Valdi: Value Diffusion World Models for MPC

Task-Aware LLM Quantization Improves Efficiency and Performance.