Agentic RAG-VLM Enhances Robotic Grasping with Self-Reflecti

Agentic RAG-VLM Enhances Robotic Grasping with Self-Reflection.

Tao Chen, Lizheng Liu, Jiaxu Wang, Ziyue Jiang, Ruiqi Tian, JiGuang Huo, Zhongxue Gan· July 1, 2026 View original

Summary

Agentic RAG-VLM is a unified framework that improves robotic grasping in cluttered environments by integrating affordance-aware retrieval, scene graph reasoning, and agentic self-reflective planning. It achieves 78.3% success, a 53.3 percentage-point gain over VLM-only baselines, by considering physical affordances and enabling closed-loop refinement.

Deploying robotic manipulators in complex, unstructured human environments requires generalizable grasping capabilities, especially in cluttered settings. Existing Vision-Language Model (VLM)-based methods often rely solely on visual similarity for object matching, neglecting crucial physical affordances like handle graspability or material fragility. Furthermore, these systems typically operate open-loop, lacking spatial reasoning and failure recovery mechanisms, which limits their effectiveness with densely packed or physically diverse objects. This research introduces Agentic RAG-VLM, a comprehensive framework designed to bridge the gap between VLM semantic understanding and physically grounded grasp execution. Agentic RAG-VLM integrates retrieval-augmented generation (RAG) with VLMs and agentic self-reflective planning through three interconnected components. First, a Hierarchical Affordance-Aware RAG (HAA-RAG) encodes four-dimensional affordance descriptors (type, material, fragility, graspable region) to retrieve strategies based on functional compatibility rather than just visual appearance. Second, a Scene Graph Constraint Reasoner builds spatial relationship graphs from VLM perception, translating proximity, occlusion, and support constraints into precise grasp parameter adjustments. Finally, an Agentic Self-Reflective Pipeline, equipped with a 14-type failure taxonomy and three-level adaptive retry, enables closed-loop grasp refinement. Evaluated on a 12-task benchmark with 360 trials per configuration, Agentic RAG-VLM achieved an impressive 78.3% overall success rate, representing a 53.3 percentage-point absolute gain over VLM-only baselines, demonstrating the critical importance of affordance-aware retrieval, scene graph reasoning, and agentic recovery for robust manipulation.

Why it matters

This framework represents a significant leap forward for robotic manipulation, enabling robots to perform more complex and reliable grasping tasks in real-world, unstructured environments. It is crucial for advancing automation in logistics, manufacturing, and service robotics.

How to implement this in your domain

1Integrate affordance-aware retrieval and scene graph reasoning into robotic manipulation systems for improved grasp planning.
2Implement agentic self-reflective planning with failure taxonomies for robust error recovery in robotic tasks.
3Develop training datasets that include detailed physical affordance descriptors for objects.
4Apply this framework to automate complex assembly or pick-and-place tasks in manufacturing and logistics.

Who benefits

RoboticsLogisticsManufacturingHealthcareService Automation

Key takeaways

Agentic RAG-VLM improves robotic grasping in cluttered environments with self-reflection.
It uses Hierarchical Affordance-Aware RAG for functional compatibility-based strategy retrieval.
A Scene Graph Constraint Reasoner translates spatial relationships into grasp adjustments.
The framework achieves 78.3% success, a 53.3% gain over VLM-only baselines.

Original post by Tao Chen, Lizheng Liu, Jiaxu Wang, Ziyue Jiang, Ruiqi Tian, JiGuang Huo, Zhongxue Gan

"arXiv:2606.31200v1 Announce Type: new Abstract: Generalizable robotic grasping in cluttered environments is essential for deploying manipulators in unstructured human spaces, yet existing VLM-based methods rely on visual similarity for object matching, neglecting physical afforda…"

View on X

Originally posted by Tao Chen, Lizheng Liu, Jiaxu Wang, Ziyue Jiang, Ruiqi Tian, JiGuang Huo, Zhongxue Gan on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

Agentic RAG-VLM Enhances Robotic Grasping with Self-Reflection.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

Philosophical Foundations for Explainable AI in Healthcare Explored

New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

New ACE Module Boosts LLM Agent Context Management