New Framework Reduces Visual Hallucinations in MLLMs

Pratheswaran Hariharan, Haiping Xu, Donghui Yan· June 16, 2026 View original

Summary

A new retrieval-augmented reliability-aware inference framework is proposed to mitigate visual hallucinations and overconfident predictions in multimodal large language models (MLLMs). The system uses an external visual evidence database and multiple reliability indicators to determine prediction trustworthiness, allowing it to accept, caution, or abstain from answers, thereby improving accuracy and reducing wrong-answer rates without retraining the MLLM.

Researchers have developed a novel framework aimed at reducing visual hallucinations and overconfident errors in multimodal large language models (MLLMs). MLLMs, despite their advanced capabilities in vision-language understanding, often produce unreliable outputs when visual evidence is weak, ambiguous, or inconsistent. Existing solutions typically focus on improving representation alignment or retrieval-augmented generation but lack mechanisms to quantify prediction reliability at an instance level. The proposed framework integrates retrieval-augmented reliability-aware inference. It constructs an external visual evidence database using pre-trained visual embeddings and nearest-neighbor retrieval. This retrieved evidence is then used to estimate prediction trustworthiness through several reliability indicators, including similarity strength, class-support agreement, and entropy-based uncertainty. Based on these signals, a decision gate determines whether the MLLM should confidently accept a prediction, answer with caution, or abstain/fallback when evidence is insufficient. A multimodal response-generation layer then crafts the final user-facing response, conditioned on this reliability decision. Experiments on ImageNet-100 demonstrated that this framework improved accepted prediction accuracy from 85.84% to 88.88% and reduced the hallucination-like accepted wrong-answer rate from 14.16% to 11.12%, all without requiring retraining of the core MLLM.

Why it matters

Reducing hallucinations and improving the trustworthiness of MLLMs is critical for their deployment in sensitive applications like medical diagnosis, autonomous driving, and content moderation. Professionals building or using MLLMs need methods to ensure their outputs are reliable and grounded in visual evidence.

How to implement this in your domain

1Implement a retrieval-augmented visual evidence database to provide external context for MLLM predictions.
2Integrate multiple reliability indicators (e.g., similarity, entropy) to quantify the trustworthiness of MLLM outputs.
3Develop a decision-making gate that allows MLLMs to accept, caution, or abstain from predictions based on reliability scores.
4Apply this framework to existing MLLM deployments to reduce visual hallucinations and improve overall system reliability without costly retraining.
5Design user interfaces that communicate the MLLM's confidence level or reasons for caution/abstention to end-users.

Who benefits

HealthcareAutonomous VehiclesContent ModerationAI DevelopmentSecurity

Key takeaways

A new framework uses retrieval-augmented reliability-aware inference to reduce MLLM visual hallucinations.
External visual evidence and multiple reliability indicators quantify prediction trustworthiness.
The system can accept, caution, or abstain from predictions, improving accuracy and reducing errors.
This approach enhances MLLM reliability without requiring model retraining.

Original post by Pratheswaran Hariharan, Haiping Xu, Donghui Yan

"arXiv:2606.15782v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-language understanding and natural-language response generation. However, these systems can still produce overconfident predictions and halluci…"

View on X

Originally posted by Pratheswaran Hariharan, Haiping Xu, Donghui Yan on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New Framework Reduces Visual Hallucinations in MLLMs

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets