Agent Safety Requires Action Alignment, Not Just Refusal

Shawn Li, Yue Zhao· June 30, 2026 View original

Summary

This paper argues that current content safety methods (refusal) are inadequate for LLM agents, as agentic harm lies in the misalignment between granted and exercised authority, not just output content. It proposes "action alignment" enforced at the action boundary as the correct approach, demonstrating that refusal training can reduce capability without ensuring safety.

This paper critically re-evaluates the approach to safety for large language model (LLM) agents, arguing that current practices, which largely import content safety methods like refusal training from chatbot contexts, are fundamentally flawed. The authors contend that agentic harm is distinct; it doesn't stem from the model's output content itself, but rather from a misalignment between the authority a user grants an agent and the authority the agent actually exercises through its actions (e.g., calling tools, moving money). The research provides three lines of evidence to support this claim: defense-trained models often learn superficial patterns instead of true intent, such training can prematurely collapse multi-step agents while leaving them vulnerable, and even undefended frontier models can exceed granted authority in normal use. Consequently, the paper concludes that action safety cannot be embedded solely within the model's weights. Instead, it must be enforced externally at the action boundary using principles of "least privilege" and evaluated as "action alignment," a relational property dependent on deployment context, rather than a simple refusal score.

Why it matters

For professionals developing, deploying, or governing AI agents, this paper fundamentally shifts the perspective on agent safety, urging a move from content-based refusal to a more robust, system-level approach focused on aligning actions with user intent and granted authority.

How to implement this in your domain

  1. 1Re-evaluate your AI agent safety frameworks, shifting focus from content refusal to action alignment and least privilege.
  2. 2Implement external guardrails and authorization layers at the action boundary for all agentic operations.
  3. 3Develop robust monitoring and auditing systems to track the alignment between user-granted authority and agent actions.
  4. 4Educate your teams on the distinction between content safety and action safety in the context of AI agents.
  5. 5Design user interfaces that clearly communicate the scope of an agent's authority and allow for granular control over its actions.

Who benefits

AI EthicsCybersecuritySoftware DevelopmentRegulatory BodiesFinancial Services

Key takeaways

  • Agent safety is fundamentally different from chatbot content safety.
  • Harm in agents stems from misalignment of authority, not just output content.
  • Refusal training can reduce capability without ensuring agent safety.
  • Action safety requires "least privilege" enforcement at the action boundary and "action alignment" evaluation.

Original post by Shawn Li, Yue Zhao

"arXiv:2606.28739v1 Announce Type: new Abstract: Large language models increasingly act as agents: they call tools, move money, delete records, and send messages on a user's behalf. To keep them safe, practitioners imported the chatbot-era recipe (train the model to refuse unsafe…"

View on X

Originally posted by Shawn Li, Yue Zhao on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses