SafeClawBench Evaluates Tool-Using LLM Agent Security with Staged Harm Metrics

Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang· June 18, 2026 View original

Summary

This paper introduces SafeClawBench, a benchmark designed to evaluate the security of tool-using language model agents by separating semantic attack acceptance, audit-visible harm evidence, and actual sandbox-observed tool/state harm. It provides 600 adversarial tasks across six attack families.

Tool-using language model agents introduce complex security challenges beyond just unsafe text generation, as they can interact with external systems to disclose sensitive data, modify databases, or trigger harmful code. Existing security evaluations often oversimplify these risks by reporting a single attack success rate, which obscures the specific stages of failure. Researchers have developed SafeClawBench, a new staged benchmark specifically for assessing the security of these advanced AI agents. It features 600 controlled adversarial tasks covering six distinct attack families, including various forms of prompt injection, memory manipulation, and ambiguity-driven unsafe inference. SafeClawBench provides granular reporting across three critical endpoints: whether the agent semantically accepts an attack, if there is audit-visible evidence of harm, and if actual harm is observed in a sandbox environment (e.g., tool execution or state changes). This detailed approach reveals that these endpoints capture different failure modes, with significant instances of sandbox harm occurring even when semantic checks are passed, highlighting the need for comprehensive evaluation beyond textual compliance.

Why it matters

For professionals developing, deploying, or auditing AI agents that interact with real-world systems, SafeClawBench offers a crucial framework for understanding and mitigating complex security risks. It enables a more precise evaluation of agent safety, moving beyond superficial textual analysis to assess actual operational harm.

How to implement this in your domain

  1. 1Adopt SafeClawBench as a standard for evaluating the security posture of tool-using LLM agents in development.
  2. 2Design agent architectures with explicit mechanisms to log and audit tool interactions and state changes.
  3. 3Implement multi-layered security policies that address semantic understanding, audit evidence, and sandbox execution.
  4. 4Train and test agents against diverse adversarial tasks, focusing on the distinct failure modes identified by SafeClawBench.

Who benefits

AI DevelopmentCybersecurityRoboticsAutomationCloud Services

Key takeaways

  • Tool-using LLM agents pose complex security risks beyond unsafe text.
  • SafeClawBench evaluates agent security by separating semantic, audit-evidence, and sandbox harm.
  • It provides 600 adversarial tasks across six attack families.
  • Sandbox harm can occur even when semantic checks pass, requiring comprehensive evaluation.

Original post by Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang

"arXiv:2606.18356v1 Announce Type: cross Abstract: Tool-using language-model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Exis…"

View on X

Originally posted by Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses