Activation Patching Flaws Revealed: Hidden Interaction Effects Impact Interpretability

Sankaran Vaidyanathan, David Arbour, Aaron Mueller, Scott Niekum, David Jensen· June 29, 2026 View original

Summary

This paper reveals that activation patching, a key mechanistic interpretability tool, suffers from hidden interaction effects (INT) that distort causal attribution. These INTs, which measure how a component's effect depends on others, can make components invisible or artificially inflated, explaining faithfulness score instability.

A new research paper highlights a significant limitation in activation patching, a widely used technique for understanding the internal workings of AI models, particularly large language models. The study re-derives the activation patching estimand from causal mediation analysis and discovers that the Natural Indirect Effect (NIE), which is supposed to isolate the causal responsibility of individual model components, also implicitly includes "interaction effects" (INT). These INTs quantify how much a component's causal influence is contingent on the state of other components within the model. The presence of these hidden interaction effects means that activation patching may not accurately reflect the true causal importance of individual components. The paper demonstrates that components whose importance is conditional on others can either be overlooked entirely or have their significance artificially exaggerated. This phenomenon also accounts for the previously observed instability in faithfulness scores, which are metrics used to evaluate the reliability of interpretability methods. While attempts to eliminate INTs through estimator adjustments or changes in the unit of analysis prove problematic, the research suggests that INTs are not merely a nuisance. Instead, their magnitude and sign can serve as a diagnostic tool, indicating when causal conclusions are prompt-dependent or when a greedy, NIE-based ranking of components might miss crucial mechanisms that only combinatorial search could uncover. INTs scale with the distance between clean and patched activations and are negligible only when the model is locally affine.

Why it matters

For professionals relying on mechanistic interpretability to understand and debug AI models, this research exposes a fundamental flaw in a primary tool, necessitating a more nuanced approach to interpreting model behavior and ensuring reliability.

How to implement this in your domain

  1. 1Re-evaluate existing interpretability studies that heavily rely on activation patching, considering the potential for hidden interaction effects.
  2. 2Incorporate diagnostics for interaction effects (INT) when performing activation patching to understand context dependency.
  3. 3Explore alternative or complementary interpretability methods that are less susceptible to interaction effects.
  4. 4Develop new interpretability techniques that explicitly account for or model higher-order interactions between model components.

Who benefits

AI/ML DevelopmentCybersecurityHealthcare (AI diagnostics)Autonomous SystemsRegulatory Compliance

Key takeaways

  • Activation patching's Natural Indirect Effect includes hidden interaction effects (INT).
  • INTs distort causal attribution, making components appear invisible or inflated.
  • These effects explain the instability of faithfulness scores in interpretability.
  • INTs can be a diagnostic tool for prompt-dependent causal conclusions.

Original post by Sankaran Vaidyanathan, David Arbour, Aaron Mueller, Scott Niekum, David Jensen

"arXiv:2606.27510v1 Announce Type: new Abstract: Activation patching is the primary tool in mechanistic interpretability. It attributes causal responsibility for a model behavior to each of its individual components by estimating its natural indirect effect (NIE). Re-deriving the…"

View on X

Originally posted by Sankaran Vaidyanathan, David Arbour, Aaron Mueller, Scott Niekum, David Jensen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses