Cross-Modal Representation Alignment Improves Time-to-Event

Cross-Modal Representation Alignment Improves Time-to-Event Prediction in Healthcare.

Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee· June 16, 2026 View original

Summary

This research introduces a foundation model-driven framework for aligning CT imaging and longitudinal EHR data to improve time-to-event (TTE) prediction in clinical settings. It systematically analyzes various fusion strategies, finding that task-aware multimodal alignment is crucial for robust generalization across different tasks and institutions.

Predicting time-to-event (TTE) outcomes from diverse clinical data, such as CT imaging and electronic health records (EHR), presents significant challenges due to imbalances and distribution shifts between modalities. This study proposes a novel foundation model-driven framework designed to align cross-modal representations, aiming for improved generalization across various clinical tasks and institutions. The framework independently encodes CT and EHR data using domain-specific foundation models, then aligns them in a shared latent space through four distinct fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. The researchers evaluated the framework on two distinct TTE tasks: predicting pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, using large-scale multi-institutional cohorts. Consistent improvements in concordance index (1.5-5.4%) were observed over unimodal baselines when both modalities contributed comparably. Specifically, contrastive multimodal fusion, particularly with CLMBR representations, demonstrated the most consistent and statistically robust enhancements, especially for PE mortality. For major adverse cardiovascular events (MACE), cross-attention achieved the highest internal performance, while image-guided co-attention excelled in external performance. These findings underscore that a "one-size-fits-all" approach to fusion is ineffective, establishing task-aware multimodal alignment as a critical design principle for achieving robust generalization and scalable clinical deployment.

Why it matters

For healthcare professionals and AI developers in medicine, this framework offers a powerful approach to improve the accuracy and generalizability of prognostic models using multimodal patient data. This can lead to more precise risk stratification, better treatment planning, and ultimately, improved patient outcomes.

How to implement this in your domain

1Explore integrating this cross-modal alignment framework into your clinical predictive modeling pipelines.
2Evaluate different fusion strategies (e.g., contrastive, cross-attention) based on the specific time-to-event prediction task.
3Leverage domain-specific foundation models for encoding diverse clinical data modalities like imaging and EHR.
4Design task-aware multimodal alignment strategies to ensure robust generalization across various patient cohorts and institutions.

Who benefits

HealthcarePharmaceuticalsMedical DevicesHealth Insurance

Key takeaways

A foundation model-driven framework aligns CT imaging and EHR data for TTE prediction.
Multimodal fusion consistently improves prediction accuracy over unimodal baselines.
Contrastive alignment and cross-attention show strong performance depending on the task.
Task-aware multimodal alignment is crucial for robust generalization in clinical AI.

Original post by Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee

"arXiv:2606.15038v1 Announce Type: new Abstract: Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment be…"

View on X

Originally posted by Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

Cross-Modal Representation Alignment Improves Time-to-Event Prediction in Healthcare.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets