SAGE Improves LLM Unlearning by Preserving Retained Knowledge

Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang· June 18, 2026 View original

Summary

SAGE (Spectral Activation-GEometry Sanitization) is a novel post-hoc method for machine unlearning in Large Language Models (LLMs) that sanitizes the final unlearning update vector to mitigate the trade-off between forgetting undesirable knowledge and preserving retained capabilities. It uses retain activation bias to quantify damage and applies a source-agnostic correction to restore retention performance.

Machine unlearning in Large Language Models (LLMs) aims to remove specific, undesirable knowledge or behaviors while ensuring that the model's general capabilities and retained knowledge are not compromised. A common challenge is the inherent trade-off: methods designed to forget often inadvertently degrade the model's ability to retain other important information. This paper introduces SAGE, a post-hoc sanitization approach that addresses this trade-off without requiring a rerun of the original unlearning process. SAGE leverages the "retention activation bias" to quantify the damage an unlearning method inflicts on retained capabilities. SAGE collects real module inputs from a small "retain proxy" and extracts their dominant activation geometry. It then solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the forgetting mechanism of the original unlearning method. Across various unlearning methods, model scales, and benchmarks, SAGE consistently improves the retain-forget trade-off, highlighting the potential of post-hoc vector sanitization.

Why it matters

This research is crucial for developing more robust and compliant AI systems, especially in contexts requiring data privacy, ethical AI, or regulatory adherence (e.g., "right to be forgotten"). Professionals can use this to build LLMs that are more adaptable and responsible.

How to implement this in your domain

  1. 1Explore SAGE's post-hoc sanitization for improving the retention-forgetting trade-off in your LLM unlearning pipelines.
  2. 2Investigate using "retain activation bias" as a metric to quantify and mitigate damage to retained knowledge during unlearning.
  3. 3Consider applying spectral activation-geometry sanitization to refine update vectors from existing unlearning methods.
  4. 4Implement strategies to ensure that unlearning processes do not inadvertently degrade the overall performance of your AI models.

Who benefits

Data PrivacyCybersecurityAI/ML ResearchHealthcareBFSI

Key takeaways

  • LLM unlearning faces a trade-off between forgetting and retaining knowledge.
  • SAGE is a post-hoc method to sanitize unlearning update vectors.
  • It uses retain activation bias to quantify and correct retention damage.
  • SAGE consistently improves the retain-forget trade-off across methods and models.

Original post by Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang

"arXiv:2606.18309v1 Announce Type: new Abstract: Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that…"

View on X

Originally posted by Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses