New Benchmark Reveals AI Agents Fail Hidden Social Norms

Shiyun Zhao, Xinwei Song, Tianyu Guo, Xiaomeng Gao, Mingyuan Liu, Xu Han, Yuanyuan Zhang, Zhenliang Zhang, Xue Feng, Bo Dai· June 29, 2026 View original

Summary

Researchers introduce NormAct, a benchmark for embodied social-norm interactions that evaluates multimodal large language models (MLLMs) on their ability to infer and comply with hidden social norms during planning. Experiments show a significant gap between explicit goal achievement and hidden norm compliance in state-of-the-art MLLMs, proposing NormPerceptor to address this.

As multimodal large language models (MLLMs) are increasingly used as embodied agents in virtual or physical environments, their ability to navigate not just explicit goals but also implicit social norms becomes critical. This research highlights a significant gap in current MLLM capabilities by introducing NormAct, a new benchmark specifically designed to test compliance with hidden social norms during embodied planning. NormAct presents ordinary tasks where success requires both achieving an instructed goal and adhering to unstated social conventions. Evaluations of leading MLLMs (like GPT-5.4, Claude Opus 4.7, Gemini 3 Pro) revealed a stark disparity: while models achieved explicit goals in over two-thirds of cases, they complied with hidden norms in less than a third. This suggests the issue isn't a lack of general social knowledge, but rather difficulty in activating and applying relevant norms within a specific context. To address this, the paper proposes NormPerceptor, a context-conditioned cue generator that infers scene-relevant norms before planning. Integrating NormPerceptor significantly improved task success, nearly doubling it. This work underscores the necessity for embodied agents to proactively detect, ground, and integrate hidden social norms into their action planning to achieve truly appropriate and effective behavior.

Why it matters

For professionals developing embodied AI, robotics, or virtual assistants, ensuring social appropriateness is as important as task completion. This benchmark and proposed solution highlight a critical area for development, enabling the creation of AI systems that are not only functional but also socially intelligent and acceptable in human environments.

How to implement this in your domain

1Evaluate existing embodied AI agents against the NormAct benchmark to identify gaps in social norm compliance.
2Develop context-aware modules that can infer and activate relevant social norms based on environmental cues.
3Integrate norm-grounding mechanisms into MLLM planning pipelines to ensure social constraints are considered alongside explicit goals.
4Prioritize training data and fine-tuning strategies that emphasize implicit social understanding for embodied agents.

Who benefits

RoboticsAutomotiveHealthcareCustomer ServiceGaming

Key takeaways

Embodied MLLMs struggle significantly with inferring and complying with hidden social norms.
The NormAct benchmark reveals a large gap between explicit goal achievement and social appropriateness.
The issue stems from difficulty in activating and grounding social knowledge in context, not a lack of general knowledge.
NormPerceptor, a cue generator, can improve norm compliance and overall task success.

Original post by Shiyun Zhao, Xinwei Song, Tianyu Guo, Xiaomeng Gao, Mingyuan Liu, Xu Han, Yuanyuan Zhang, Zhenliang Zhang, Xue Feng, Bo Dai

"arXiv:2606.27826v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) are increasingly deployed as embodied planners in egocentric environments, where task success requires not only achieving instructed goals but also acting in socially appropriate ways. While…"

View on X

Originally posted by Shiyun Zhao, Xinwei Song, Tianyu Guo, Xiaomeng Gao, Mingyuan Liu, Xu Han, Yuanyuan Zhang, Zhenliang Zhang, Xue Feng, Bo Dai on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New Benchmark Reveals AI Agents Fail Hidden Social Norms

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

Sky Pro Cloud Rendering Optimized, Cost Cut by 50%

Popping the GPU Bubble

LongCat-2.0 Model Launching Soon on Hugging Face