MER-R1 Improves Multimodal Emotion Recognition with Slow-Fast Thinking

Zhiyuan Han, Beier Zhu, Wenwen Tong, Chengwei Qin, Xinyi Wang, Jiayu Zhang, Jiangnan Chen, Hewei Guo, Dongchuan Ran, Lewei Lu, Xun Yang· June 29, 2026 View original

Summary

This research introduces MER-R1, a reinforcement learning framework that enhances multimodal emotion recognition by synergizing "slow thinking" (deliberative reasoning) and "fast thinking" (direct intuition). It optimizes recall and precision separately and calibrates confidence to achieve state-of-the-art performance, making reasoning genuinely beneficial for emotion recognition.

In multimodal emotion recognition (MER), simply adding explicit reasoning to large multimodal language models (MLLMs) doesn't always translate to better accuracy, even if it makes predictions more interpretable. Researchers observed that "fast thinking," which involves triggering direct answers, often outperforms "slow thinking," which relies on deliberative reasoning. Their analysis revealed that fast thinking boosts recall by making broader, more confident predictions, while slow thinking improves precision by conservatively filtering out incorrect categories. Building on these insights, the paper proposes MER-R1, a novel reinforcement learning framework designed to leverage the complementary strengths of slow and fast thinking. MER-R1 employs dual-objective disentanglement, separating recall and precision into distinct optimization signals, allowing them to be jointly optimized rather than traded off. Furthermore, slow-fast confidence calibration aligns the final slow-thinking answer with the initial fast-thinking intuition. This process strengthens correct emotional predictions while suppressing incorrect ones. The theoretical justification shows this synergy mitigates variance-induced interference during optimization. Extensive experiments on MER-UniBench and MME-Emotion datasets demonstrate that MER-R1 achieves state-of-the-art performance, proving that reasoning can indeed genuinely benefit emotion recognition when integrated synergistically.

Why it matters

For professionals developing AI systems that interact with humans or analyze human behavior, MER-R1 offers a significant leap in multimodal emotion recognition accuracy and interpretability, crucial for applications in customer service, mental health, and human-robot interaction.

How to implement this in your domain

  1. 1Evaluate existing multimodal AI systems for emotion recognition capabilities and identify areas for improvement.
  2. 2Investigate integrating "slow-fast thinking" paradigms into AI models for complex decision-making tasks.
  3. 3Explore dual-objective optimization techniques to balance recall and precision in AI model training.
  4. 4Apply confidence calibration methods to align model outputs with underlying intuitive predictions.

Who benefits

Customer ServiceHealthcareRoboticsAutomotiveMedia & Entertainment

Key takeaways

  • Explicit reasoning in MLLMs doesn't always improve emotion recognition accuracy.
  • "Fast thinking" boosts recall, while "slow thinking" enhances precision.
  • MER-R1 synergizes these two thinking styles for state-of-the-art performance.
  • The framework uses dual-objective optimization and confidence calibration to improve accuracy.

Original post by Zhiyuan Han, Beier Zhu, Wenwen Tong, Chengwei Qin, Xinyi Wang, Jiayu Zhang, Jiangnan Chen, Hewei Guo, Dongchuan Ran, Lewei Lu, Xun Yang

"arXiv:2606.27652v1 Announce Type: new Abstract: We find that explicit reasoning does not necessarily translate into better multimodal emotion recognition (MER) accuracy, even though it makes predictions more interpretable. Specifically, for reasoning-based MLLMs, fast thinking by…"

View on X

Originally posted by Zhiyuan Han, Beier Zhu, Wenwen Tong, Chengwei Qin, Xinyi Wang, Jiayu Zhang, Jiangnan Chen, Hewei Guo, Dongchuan Ran, Lewei Lu, Xun Yang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses