AI Unlearning Methods May Not Truly Erase Data

Teresa Pui Yee Yong, Win Kent Ong, Chee Seng Chan· June 25, 2026 View original

▶ The 2-minute explainer

Summary

New research suggests that current machine unlearning (MU) methods, often judged by output forgetting, may not achieve true forgetting in representation space. Models can appear to forget at the output layer while retaining retraining-inconsistent residuals, indicating a structured mismatch.

Machine unlearning (MU) aims to remove specific data from a trained model, typically evaluated by whether the model's outputs no longer reflect the "forgotten" data. However, this research challenges the sufficiency of output-level evaluation, proposing that models might still retain traces of unlearned data in their internal representations, even if their external behavior suggests forgetting. The study introduces "retraining-consistent representation forgetting" as a more rigorous benchmark, comparing unlearned models against models trained from scratch without the forget data. Across various unlearning methods, datasets, and models, the findings indicate that standard output-level metrics often overstate the success of unlearning. The analysis reveals that unlearned models frequently exhibit a structured mismatch in their representation space compared to truly retrained models. This mismatch is characterized by an asymmetry between forgotten and retained samples, directional inconsistencies, and concentrated residual discrepancies along retraining-related directions. This suggests that current MU techniques achieve "apparent forgetting" rather than a complete, retraining-consistent erasure of information.

Why it matters

For professionals building and deploying AI systems, especially in privacy-sensitive domains, understanding the limitations of current machine unlearning is crucial. Relying solely on output-level metrics might lead to false assurances regarding data privacy and compliance, necessitating more robust evaluation methods.

How to implement this in your domain

  1. 1Re-evaluate existing machine unlearning strategies using representation-level metrics to ensure true data erasure.
  2. 2Develop new unlearning algorithms that specifically target and remove information from the model's internal representation space.
  3. 3Integrate retraining-consistent evaluation protocols into the development and auditing of privacy-preserving AI systems.
  4. 4Educate stakeholders on the distinction between output forgetting and true representation forgetting in AI models.
  5. 5Prioritize research into more robust and verifiable unlearning mechanisms for sensitive applications.

Who benefits

CybersecurityHealthcareBFSILegalTechAI Ethics

Key takeaways

  • Output-level forgetting in AI models does not guarantee true data erasure.
  • Models can retain structured traces of "unlearned" data in their internal representations.
  • Current machine unlearning methods may overestimate their effectiveness.
  • More rigorous evaluation, like retraining-consistent representation forgetting, is needed.

Original post by Teresa Pui Yee Yong, Win Kent Ong, Chee Seng Chan

"arXiv:2606.25001v1 Announce Type: new Abstract: Machine unlearning (MU) is commonly judged by output forgetting, such as low forget-set accuracy or reduced logit-level membership inference. But if output-level success can coexist with retraining-inconsistent residuals in represen…"

View on X

Originally posted by Teresa Pui Yee Yong, Win Kent Ong, Chee Seng Chan on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses