GRID Learns Universal Behaviors from Diverse Agent Observations

Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung· June 18, 2026 View original

Summary

Researchers introduce General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from heterogeneous populations of demonstrators. GRID decomposes reward functions into general and specific components, enabling generalist pretraining without mode-averaging bias.

Humans often learn new skills by observing others, inferring effective actions from demonstrated behaviors. However, when observations come from a diverse group of agents with varying goals, it becomes challenging to discern which behaviors are universally beneficial. This research addresses this problem with General Reward Inference and Disentanglement (GRID). GRID is a social learning methodology designed to extract universally useful behaviors from a heterogeneous population of demonstrators. It achieves this by decomposing each agent's reward function into two parts: a "general reward" that captures behaviors common across all agents, and "specific rewards" that reflect individual preferences and objectives. Training an agent exclusively on the general reward provides a novel paradigm for generalist pretraining. This approach yields a generalist agent that internalizes fundamental environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias often seen in standard learning from demonstration techniques. This generalist agent then serves as a superior prior for fine-tuning to specific downstream tasks, even those with preferences not observed during initial training. Experiments across various environments, including multi-agent Craftax and an autonomous driving simulator, confirm GRID's ability to semantically disentangle reward structures, outperform baseline methods, and enable more efficient and stable specialization.

Why it matters

This research offers a significant step towards building more adaptable and general-purpose AI agents that can learn efficiently from diverse human or AI demonstrations. It's crucial for developing AI that can operate robustly in complex, multi-agent environments and generalize to new tasks with minimal retraining.

How to implement this in your domain

  1. 1Apply GRID's reward decomposition technique to learn generalizable policies from diverse human or simulated agent demonstrations.
  2. 2Develop generalist AI agents by pretraining them on universal behaviors extracted using GRID, then fine-tune for specific tasks.
  3. 3Utilize GRID to mitigate mode-averaging bias in imitation learning scenarios with heterogeneous data sources.
  4. 4Integrate GRID into multi-agent reinforcement learning systems to foster cooperative or universally beneficial behaviors.
  5. 5Explore the application of GRID in robotics for learning robust foundational skills from varied human demonstrations.

Who benefits

RoboticsAutonomous VehiclesGaming AIMulti-Agent SystemsAI Research

Key takeaways

  • GRID learns universal behaviors from diverse agents by disentangling general and specific rewards.
  • It enables generalist pretraining, yielding agents with fundamental environmental competencies.
  • This method avoids the mode-averaging bias common in standard imitation learning.
  • Generalist agents trained with GRID serve as strong priors for efficient fine-tuning to new tasks.

Original post by Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

"arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, maki…"

View on X

Originally posted by Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses