IMCBench: New Benchmark for Multimodal Medical LLMs

Maria Xenochristou, Ashutosh Joshi, Korosh Vatanparvar, Mohammad Abuzar Hashemi, Prasad Kasu, Deepak Bansal, Anchal Nema, Nivedita Wadhwa, Prashams S Jain, Rebecca Abraham, Will Kimbrough, Dilek Hakkani-Tur, Wilko Schulz-Mahlendorf· June 30, 2026 View original

▶ The 2-minute explainer

Summary

Researchers introduce IMCBench, a new benchmark for evaluating multimodal LLMs in image-grounded, multi-turn medical conversations, using real clinical images and synthetic patient profiles. The benchmark assesses models on safety, accuracy, and appropriate uncertainty, revealing that while Claude Opus 4.6 leads, no model excels across all dimensions, and safety degrades for complex conditions.

A new benchmark, IMCBench, has been developed to rigorously evaluate multimodal large language models (LLMs) in the context of medical conversations that involve both text and images. Existing medical AI benchmarks often fall short by either lacking image input for multi-turn dialogues or focusing only on single-turn questions with images. IMCBench addresses this by simulating realistic patient-clinician interactions, pairing publicly available clinical images with synthetic patient profiles to create multi-turn scenarios. The evaluation protocol is comprehensive, assessing models across three critical clinical dimensions: safety, accuracy, and the appropriate use of uncertainty in diagnoses. The study benchmarked eight frontier multimodal models, including those from Claude, GPT, Nova, and Llama families. Results indicate that Claude Opus 4.6 achieved the highest overall score, but no single model dominated all evaluation criteria. A significant finding was the degradation of safety performance for both malignant and rare conditions. Ablation studies further highlighted the importance of both visual input and electronic health record (EHR) context for safe guidance, with stronger models demonstrating better utilization of visual features. This research underscores that merely providing accurate clinical descriptions does not guarantee safe patient guidance, emphasizing the need for multi-dimensional evaluation in medical AI.

Why it matters

This benchmark provides a crucial tool for professionals developing or deploying AI in healthcare, enabling more robust and clinically relevant evaluation of multimodal LLMs, especially concerning safety and nuanced medical reasoning.

How to implement this in your domain

1Utilize IMCBench to evaluate the safety and accuracy of multimodal AI models in medical applications.
2Prioritize multi-dimensional evaluation frameworks that include safety, accuracy, and uncertainty handling.
3Conduct ablation studies to understand the contribution of different input modalities (e.g., visual, EHR) to model performance.
4Focus development efforts on improving model safety, particularly for rare and malignant conditions.
5Collaborate with clinical experts to refine evaluation criteria and model outputs.

Who benefits

HealthcareMedical AIPharmaceuticalsBiotech

Key takeaways

IMCBench offers a comprehensive benchmark for multimodal medical LLMs.
It evaluates models on safety, accuracy, and uncertainty in clinical conversations.
No current model excels across all dimensions, with safety being a key concern.
Visual input and EHR context are crucial for safe medical AI guidance.

Original post by Maria Xenochristou, Ashutosh Joshi, Korosh Vatanparvar, Mohammad Abuzar Hashemi, Prasad Kasu, Deepak Bansal, Anchal Nema, Nivedita Wadhwa, Prashams S Jain, Rebecca Abraham, Will Kimbrough, Dilek Hakkani-Tur, Wilko Schulz-Mahlendorf

"arXiv:2606.28556v1 Announce Type: new Abstract: Recent advances in large language models and vision-language models have enabled reasoning over multimodal data, offering opportunities for clinical applications such as decision support and triaging. However, existing medical AI be…"

View on X

Originally posted by Maria Xenochristou, Ashutosh Joshi, Korosh Vatanparvar, Mohammad Abuzar Hashemi, Prasad Kasu, Deepak Bansal, Anchal Nema, Nivedita Wadhwa, Prashams S Jain, Rebecca Abraham, Will Kimbrough, Dilek Hakkani-Tur, Wilko Schulz-Mahlendorf on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

IMCBench: New Benchmark for Multimodal Medical LLMs

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

BaRA Improves LoRA Fine-Tuning with Adaptive Rank Allocation

New Preconditioner Improves Deep Network Training Stability and Performance

SMDA Traces Training Data Influence on LLM Behavioral Policies