MOLAR Learns Multimodal Molecular Representations Despite No

MOLAR Learns Multimodal Molecular Representations Despite Noisy Labels.

Yingxu Wang, Kunyu Zhang, Nan Yin, Yu Li, Eran Segal· June 18, 2026 View original

Summary

MOLAR is a noise-aware framework designed to learn multimodal molecular representations from inherently noisy labels, common in molecular property prediction. It separates clean-property inference from label observation, deriving posterior label reliability and modality-specific evidence to outperform baselines.

Predicting molecular properties often relies on labels derived from assays, databases, or weak annotation pipelines, which are frequently noisy. If models treat these recorded labels as perfectly reliable, they risk memorizing corrupted data and learning misleading molecular characteristics. This problem is exacerbated in multimodal molecular representation learning, where errors from noisy labels can spread across different data types, such as graph and text information, during fusion or alignment processes. To address this, researchers propose MOLAR, a framework specifically designed to learn multimodal molecular representations while accounting for noisy labels. MOLAR distinguishes between inferring the true, latent clean property and observing the recorded, potentially noisy label. It allows graph and text data to contribute independent evidence towards a clean-property distribution. A separate categorical label-observation channel then maps this inferred distribution to the recorded labels for training purposes. This innovative formulation enables the model to derive both the posterior reliability of labels and modality-specific molecular evidence. Empirical evaluations on benchmarks with naturally noisy molecular data, as well as controlled label-flipping scenarios, demonstrate that MOLAR consistently surpasses other representative baseline methods. Furthermore, visualization analyses show that MOLAR provides interpretable diagnostics regarding label reliability and the evidence contributed by each modality.

Why it matters

For drug discovery and materials science, MOLAR offers a robust way to build more accurate predictive models from imperfect real-world data, accelerating research and development by improving data utilization.

How to implement this in your domain

1Assess the level of label noise in your molecular property prediction datasets.
2Consider adopting noise-aware frameworks like MOLAR for multimodal molecular representation learning.
3Implement mechanisms to separate latent clean-property inference from recorded-label observation in your models.
4Utilize MOLAR's diagnostic capabilities to understand label reliability and modality-specific evidence.

Who benefits

PharmaceuticalsBiotechnologyMaterials ScienceChemical EngineeringHealthcare

Key takeaways

MOLAR is a framework for learning multimodal molecular representations from noisy labels.
It separates clean-property inference from recorded-label observation to mitigate noise.
The framework derives posterior label reliability and modality-specific molecular evidence.
MOLAR consistently outperforms baselines on noisy molecular benchmarks.

Original post by Yingxu Wang, Kunyu Zhang, Nan Yin, Yu Li, Eran Segal

"arXiv:2606.18390v1 Announce Type: new Abstract: Motivation: Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean biolo…"

View on X

Originally posted by Yingxu Wang, Kunyu Zhang, Nan Yin, Yu Li, Eran Segal on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

MOLAR Learns Multimodal Molecular Representations Despite Noisy Labels.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets