SciRisk-Bench Evaluates AI4Science Safety Across Disciplines and Risk Dimensions

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng· June 18, 2026 View original

Summary

This paper introduces SciRisk-Bench, a new benchmark designed to evaluate the safety of Large Language Models (LLMs) in AI for Science (AI4Science) workflows. It assesses models across 7 scientific disciplines, 31 subdisciplines, and 10 explicit risk dimensions, providing fine-grained diagnostics for potential unsafe behaviors.

Researchers have developed SciRisk-Bench, a novel benchmark specifically tailored to evaluate the safety of Large Language Models (LLMs) when applied to scientific workflows, known as AI4Science. As LLMs become more integrated into tasks like scientific question answering and laboratory planning, ensuring their safety and ability to recognize risks is critical. Unlike existing safety datasets that often lack specificity regarding underlying risk dimensions, SciRisk-Bench offers a comprehensive evaluation framework. It covers 7 distinct scientific disciplines, 31 subdisciplines, and explicitly defines 10 different risk dimensions. Initial evaluations of both mainstream and science-oriented LLMs using this benchmark provide detailed diagnostics, pinpointing areas where these models may still exhibit unsafe behaviors in high-stakes scientific contexts.

Why it matters

For professionals in scientific research, development, and regulatory roles, ensuring the safety and reliability of AI tools is paramount. SciRisk-Bench provides a crucial tool to identify and mitigate risks associated with LLMs in scientific applications, preventing potential errors or unintended consequences in critical research and development.

How to implement this in your domain

  1. 1Utilize SciRisk-Bench to rigorously evaluate the safety and risk awareness of LLMs used in scientific applications.
  2. 2Prioritize LLM development that explicitly addresses identified risk dimensions across various scientific disciplines.
  3. 3Integrate safety benchmarks into the deployment pipeline for AI4Science tools to prevent unsafe model behaviors.
  4. 4Develop training curricula for scientists and engineers on how to safely interact with and deploy AI in research.

Who benefits

PharmaceuticalsBiotechnologyAcademiaChemicalEnvironmental Science

Key takeaways

  • SciRisk-Bench is a new benchmark for evaluating LLM safety in AI4Science workflows.
  • It assesses models across 7 disciplines, 31 subdisciplines, and 10 explicit risk dimensions.
  • The benchmark provides fine-grained diagnostics for identifying unsafe LLM behaviors in scientific contexts.
  • Ensuring AI safety in science is critical for preventing errors and unintended consequences in research.

Original post by Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

"arXiv:2606.18936v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an ur…"

View on X

Originally posted by Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses