ResearchAI Research AI Engineering & DevTools

COMPASS Improves Multimodal AI Composition Understanding and Generation

Ziqi Zhou, Weize Quan, Mining Tan, Zhihan Chen, Dandan Zheng, Jingdong Chen, Jun Zhou, Weiming Dong, Dong-Ming Yan· June 30, 2026 View original

Summary

COMPASS is a new unified multimodal framework that enhances AI's ability to understand and control visual composition in image generation, using a shared "expert token" for both perception and generation. It introduces a large dataset, Comp-11, for systematic learning and evaluation of composition.

This research introduces COMPASS, a novel multimodal AI framework designed to significantly improve how AI models interpret and generate visual compositions. Unlike previous systems, COMPASS unifies composition perception and generation within a single system, leveraging a unique "expert token" to anchor compositional intent. This token acts as a central signal, first distilling inferred compositional expertise during perception and then guiding the denoising process during generation to ensure explicit layout control. To facilitate robust training and evaluation, the researchers also developed Comp-11, a comprehensive dataset featuring an 11-class taxonomy and detailed reasoning-augmented annotations. Experiments demonstrate that COMPASS not only enhances category-level composition understanding but also produces more compositionally consistent and prompt-faithful images compared to existing strong baselines.

Why it matters

Professionals in creative industries or those developing AI-powered design tools can leverage this advancement to achieve more precise and controllable visual outputs, reducing manual iteration and improving creative workflows.

How to implement this in your domain

1Explore integrating COMPASS-like architectures for enhanced control in generative AI art or design platforms.
2Utilize the principles of the Comp-11 dataset to develop more structured and annotated datasets for specific compositional needs.
3Experiment with shared "expert tokens" or similar intent anchors in your own multimodal models to bridge perception and generation tasks.
4Evaluate current generative AI tools for their compositional consistency and identify areas where COMPASS's approach could offer improvements.

Who benefits

Creative ArtsAdvertisingGamingProduct DesignE-commerce

Key takeaways

COMPASS unifies composition perception and generation in multimodal AI.
It uses a shared "expert token" for consistent intent control.
The Comp-11 dataset supports systematic composition learning.
The framework significantly improves compositional understanding and generation quality.

Original post by Ziqi Zhou, Weize Quan, Mining Tan, Zhihan Chen, Dandan Zheng, Jingdong Chen, Jun Zhou, Weiming Dong, Dong-Ming Yan

"arXiv:2606.28696v1 Announce Type: new Abstract: Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such…"

View on X

Originally posted by Ziqi Zhou, Weize Quan, Mining Tan, Zhihan Chen, Dandan Zheng, Jingdong Chen, Jun Zhou, Weiming Dong, Dong-Ming Yan on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

More in AI Research

AI ResearchAI Engineering & DevTools

BaRA Improves LoRA Fine-Tuning with Adaptive Rank Allocation

Researchers introduce BaRA, a Bayesian Adaptive Rank Allocation framework for parameter-efficient fine-tuning, which dynamically adjusts adaptation capacity based on context. This method enhances predictive performance, robustness, and uncertainty calibration compared to standard LoRA and other Bayesian LoRA variants.

Zhibin Duan, Yuhong Wang, Jiahong Fu, Zongsheng Yue, Bo Chen, Zongben XuJun 30, 2026

AI ResearchAI Engineering & DevTools

New Preconditioner Improves Deep Network Training Stability and Performance

Researchers introduce Dead-Direction Conditioners (DDC), a novel preconditioning method that leverages gauge-equivariant optimization to prevent deep network training from drifting along symmetry orbits. This technique improves model stability, reduces overfitting, and enhances performance in language and vision models.

Tejas Pradeep ShirodkarJun 30, 2026

AI ResearchAI Engineering & DevTools

SMDA Traces Training Data Influence on LLM Behavioral Policies

Researchers introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes specific training examples to the interpretable symbolic policies governing an LLM's high-level behavior. SMDA offers a fine-grained diagnostic tool to understand how training data shapes model decisions, revealing safety gaps and unintended influences.

Reza Habibi, Darian Lee, Magy Seif El-NasrJun 30, 2026