Comparing Methods for Steering LLM Refusal Behavior

Elisabetta Rocchetti, Alfio Ferrara· June 15, 2026 View original

Summary

This paper compares two intervention methods, Difference-in-Means (DiM) and Iterative Nullspace Projection (INLP), for steering refusal behavior in safety fine-tuned chat models. It finds that INLP counterfactual flipping is competitive with DiM directional ablation in suppressing refusal, while nullspace projection is weaker, and explores the geometric differences in their effects on activation space.

This research investigates methods for controlling "refusal" behavior in safety-tuned large language models (LLMs), which is when a model declines to answer certain prompts. Building on prior work suggesting refusal is mediated by a single linear direction in the residual stream, this study compares two intervention techniques: Difference-in-Means (DiM) and Iterative Nullspace Projection (INLP). The comparison involved applying DiM-based interventions (activation addition and directional ablation) and INLP-derived interventions (nullspace projection and counterfactual flipping) to five open-weight chat models. The findings indicate that INLP's counterfactual flipping method performs comparably to DiM's directional ablation in suppressing refusal. However, INLP's nullspace projection proved less effective. The study also highlighted geometric differences in how these interventions affect the model's activation space. Nullspace projection tends to collapse activations between harmful and harmless clusters, whereas counterfactual flipping moves them into the opposite cluster. This suggests that LLMs might encode the absence of a concept differently from its opposite, offering intriguing avenues for future research into model interpretability and control.

Why it matters

Controlling LLM behavior, especially refusal in safety-critical applications, is paramount for reliable AI deployment. Professionals working on AI safety, alignment, and fine-tuning can use these insights to develop more precise and tunable methods for steering model responses.

How to implement this in your domain

  1. 1Experiment with INLP counterfactual flipping to fine-tune refusal behavior in safety-critical LLMs.
  2. 2Develop tools to visualize and analyze activation spaces to understand intervention effects.
  3. 3Integrate these steering techniques into LLM deployment pipelines for dynamic control over model responses.
  4. 4Research the implications of encoding "absence" versus "opposite" concepts for model interpretability.

Who benefits

AI SafetyAI EngineeringContent ModerationCybersecurityConversational AI

Key takeaways

  • INLP counterfactual flipping is effective for suppressing LLM refusal behavior.
  • DiM directional ablation also performs well in steering refusal.
  • Different interventions affect activation space geometrically in distinct ways.
  • Understanding these differences can lead to more tunable and precise LLM control.

Original post by Elisabetta Rocchetti, Alfio Ferrara

"arXiv:2606.13720v1 Announce Type: new Abstract: Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare…"

View on X

Originally posted by Elisabetta Rocchetti, Alfio Ferrara on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses