SAGE Improves LLM Unlearning by Preserving Retained Knowledge

Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang· June 18, 2026 View original

Summary

This paper introduces SAGE, a post-hoc method to sanitize unlearning updates in large language models. It aims to reduce the trade-off between removing undesirable knowledge and retaining essential capabilities.

Large Language Model (LLM) unlearning is a critical area focused on removing specific, unwanted information or behaviors from models while ensuring that their core, desired knowledge and abilities are preserved. A common challenge in current unlearning techniques is the inherent trade-off: improving unlearning often comes at the cost of degrading the model's retained capabilities. Researchers have identified that the bias in retention activation can serve as a quantifiable measure of the damage an unlearning method inflicts on a model's retained knowledge, independent of the specific unlearning process used. This insight enables a novel post-hoc approach to restore retention performance. The proposed method, SAGE (Spectral Activation-GEometry Sanitization), is a source-agnostic correction applied to the final unlearning update vector. It works by analyzing real module inputs from a small proxy, identifying dominant activation geometries, and then optimizing to suppress update components that align with high-energy retained directions, all while maintaining the original unlearning method's forgetting mechanism. SAGE has been shown to consistently alleviate the retain-forget trade-off across various unlearning methods, model scales, and benchmarks.

Why it matters

Professionals developing or deploying LLMs need robust unlearning mechanisms to comply with data privacy regulations, remove biases, or update models without compromising their core functionality. SAGE offers a practical way to enhance the effectiveness of existing unlearning methods, making models safer and more compliant.

How to implement this in your domain

  1. 1Evaluate current LLM unlearning pipelines for retention degradation using activation bias metrics.
  2. 2Integrate SAGE as a post-hoc step to sanitize final unlearning update vectors in existing unlearning workflows.
  3. 3Test the improved unlearning method with SAGE on various model scales and benchmarks to validate performance.
  4. 4Develop internal guidelines for applying post-hoc sanitization to ensure model integrity and compliance.

Who benefits

AI DevelopmentData PrivacyComplianceCloud Services

Key takeaways

  • LLM unlearning faces a trade-off between forgetting unwanted knowledge and retaining desired capabilities.
  • SAGE is a new post-hoc method that improves retention performance after unlearning.
  • It works by sanitizing the final update vector based on spectral activation geometry.
  • SAGE can be applied to various unlearning methods without re-running the original process.

Original post by Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang

"arXiv:2606.18309v1 Announce Type: cross Abstract: Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found tha…"

View on X

Originally posted by Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses