Guava Harness Unlocks Embodied AI Capabilities in Smaller Models

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao· June 18, 2026 View original

▶ The 60-second brief

Summary

Researchers introduce Guava, a universal harness framework for embodied manipulation that enables language models to perform complex tasks using external tools. The framework identifies key design principles—iterative loops, semantic action abstractions, and multimodal observations—and demonstrates how even small open-source models can achieve performance comparable to larger proprietary systems with minimal training data.

Large language models, when combined with extensive vision-language data, show significant promise for developing embodied AI agents. Instead of building complex end-to-end systems, a more effective approach involves using a "harness" that allows these models to utilize external tools for perception, planning, and control, thereby integrating high-level reasoning with specialized modules. This research introduces Guava, a novel harness framework specifically designed for embodied tool use. Through systematic exploration, the team identified three critical components for effective embodied agents: continuous perception-reasoning-action cycles, abstracting actions semantically, and incorporating diverse multimodal observations. These principles are crucial for enabling robust manipulation capabilities. To validate the universality of these design principles, the researchers developed a training pipeline that distills advanced manipulation skills into a compact 4-billion-parameter open-source model. This was achieved using a remarkably small dataset of under 2,000 simulated trajectories. Both simulated and real-world experiments confirmed that this smaller model, when equipped with Guava, could perform comparably to leading proprietary models, demonstrating strong generalization across new objects, instructions, and complex, long-duration tasks. This highlights the potential for scalable, model-agnostic interfaces to empower even smaller models with sophisticated embodied intelligence.

Why it matters

This research provides a pathway for developing highly capable embodied AI agents using smaller, more accessible models and less data. For robotics engineers and AI developers, it offers a practical framework and key design principles to build efficient and generalizable manipulation systems, potentially democratizing access to advanced robotic capabilities.

How to implement this in your domain

  1. 1Adopt the Guava framework's principles (iterative loops, semantic action abstractions, multimodal observations) when designing embodied AI systems.
  2. 2Explore distilling embodied manipulation capabilities into smaller, open-source models to reduce computational costs and increase accessibility.
  3. 3Leverage simulation environments for collecting training data to minimize real-world data collection efforts.
  4. 4Investigate tool-use architectures as an alternative to monolithic end-to-end systems for complex robotic tasks.

Who benefits

RoboticsManufacturingLogisticsHealthcareAI Research

Key takeaways

  • Guava is a harness framework for embodied AI that enables tool use.
  • Key ingredients for effective embodied agents include iterative loops, semantic actions, and multimodal observations.
  • Smaller open-source models can achieve high performance with this framework and minimal data.
  • The approach offers a scalable and model-agnostic interface for embodied manipulation.

Original post by Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

"arXiv:2606.18363v1 Announce Type: cross Abstract: Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action s…"

View on X

Originally posted by Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses