ComMem Enhances Vision-Language Model Adaptation with Dual Memory

Guanglong Sun, Shuang Cui, Bo Lei, Liyuan Wang, Zihan Zhai, Hongwei Yan, Hang Su, Jun Zhu, Yi Zhong· June 30, 2026 View original

Summary

ComMem is an innovative approach for test-time adaptation (TTA) of vision-language models (VLMs), mimicking biological complementary memory systems. It uses a fast-adapting visual cache and a slow-integrating textual memory to achieve cross-modal consistency, significantly outperforming state-of-the-art methods on various distribution shifts.

This research introduces ComMem, a novel framework designed to improve the adaptability of vision-language models (VLMs) in dynamic, real-world environments during test time. Current test-time adaptation (TTA) methods often lack the ability to accumulate knowledge over time or fully leverage the multimodal nature of VLMs. ComMem draws inspiration from the brain's complementary memory systems, specifically the hippocampus and neocortex, to address these limitations. ComMem operates with two distinct yet cooperative memory components: a fast-adapting "detailed memory" that functions like a dynamic visual cache, learning from high-confidence test samples; and a "slow-integrating abstract memory" that continuously refines global textual prototypes. For each new test instance, ComMem jointly optimizes both memory systems to ensure strong cross-modal consistency. Extensive evaluations across 15 benchmark datasets demonstrate that ComMem significantly surpasses existing state-of-the-art TTA methods, proving its effectiveness against natural distribution shifts and in cross-dataset generalization.

Why it matters

Professionals developing or deploying VLMs can leverage ComMem to create more robust and adaptable AI systems that maintain performance despite real-world data shifts, reducing the need for constant retraining and improving reliability.

How to implement this in your domain

  1. 1Explore integrating ComMem's dual-memory architecture into your VLM deployment pipeline for enhanced test-time adaptation.
  2. 2Design your VLM systems to incorporate both fast-adapting local caches and slow-integrating global knowledge bases.
  3. 3Experiment with joint optimization strategies for multimodal memory systems to ensure cross-modal consistency.
  4. 4Evaluate the performance of your VLMs under various distribution shifts and consider ComMem's approach for improving robustness.

Who benefits

Computer VisionNatural Language ProcessingRoboticsE-commerceHealthcare

Key takeaways

  • ComMem improves VLM test-time adaptation using a dual-memory system.
  • It mimics biological hippocampus (fast visual cache) and neocortex (slow textual memory).
  • The framework jointly optimizes memories for cross-modal consistency.
  • ComMem significantly outperforms existing methods on distribution shifts.

Original post by Guanglong Sun, Shuang Cui, Bo Lei, Liyuan Wang, Zihan Zhai, Hongwei Yan, Hang Su, Jun Zhu, Yi Zhong

"arXiv:2606.28719v1 Announce Type: new Abstract: Test-time adaptation (TTA) of vision-language models (VLMs) is essential for their robust deployment in dynamic, real-world environments. However, existing TTA methods often adapt locally without accumulating knowledge over time, or…"

View on X

Originally posted by Guanglong Sun, Shuang Cui, Bo Lei, Liyuan Wang, Zihan Zhai, Hongwei Yan, Hang Su, Jun Zhu, Yi Zhong on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses