PhysDrift Improves Humanoid Co-Speech Motion Generation

Zhangzhao Liang, Xiaofen Xing, Mingyue Yang, Wenlve Zhou, Xiangmin Xu· June 19, 2026 View original

Summary

This research introduces PhysDrift, a framework that directly generates physically executable humanoid joint trajectories from speech, bypassing human-centric motion representations. It addresses the "embodiment gap" where retargeting human motions to robots causes inconsistencies and reduces expressive diversity.

Current methods for generating co-speech motions for humanoid robots typically involve first creating motions for human body models and then adapting them to robots. This paper identifies a significant "embodiment gap" in this approach, where the inherent differences between human and robot body mechanics lead to inconsistencies and a reduction in motion diversity during the transfer process. This ultimately limits the expressiveness of humanoid behaviors. To overcome this, the researchers propose IK-EER, a framework for curating robot-native motion data that ensures both kinematic feasibility and precise speech-motion synchronization. Building on this, they introduce PhysDrift, a novel embodiment-aware generation framework. PhysDrift directly predicts executable humanoid joint movements from speech input, eliminating the need for intermediate human-body representations. By maintaining embodiment consistency throughout training and inference, and incorporating physical regularization, PhysDrift significantly enhances speech-motion alignment, physical plausibility, and smoothness. Extensive experiments and real-world robot deployments confirm its effectiveness, demonstrating improved efficiency and real-time interaction capabilities compared to traditional human-centric pipelines.

Why it matters

This advancement is critical for developing more natural, expressive, and physically realistic humanoid robots, enabling smoother human-robot interaction and expanding their utility in various applications.

How to implement this in your domain

  1. 1Evaluate existing co-speech motion generation pipelines for humanoid robots for embodiment consistency.
  2. 2Consider adopting robot-native motion generation approaches to improve physical plausibility and expressiveness.
  3. 3Integrate physical regularization techniques into robot motion planning for enhanced stability.
  4. 4Explore direct speech-to-robot motion mapping to reduce the "embodiment gap."

Who benefits

RoboticsEntertainmentHealthcareEducationHuman-Robot Interaction

Key takeaways

  • Existing human-centric motion generation for robots creates an "embodiment gap."
  • PhysDrift directly generates robot-native co-speech motions, bypassing human models.
  • This approach improves physical plausibility, speech-motion alignment, and smoothness.
  • The framework enhances real-time interaction capabilities for humanoid robots.

Original post by Zhangzhao Liang, Xiaofen Xing, Mingyue Yang, Wenlve Zhou, Xiangmin Xu

"arXiv:2606.19935v1 Announce Type: new Abstract: Humanoid robots require co-speech motions that are not only expressive and speech-aligned, but also physically executable under embodiment constraints. Existing co-speech generation pipelines are predominantly human-centric: motions…"

View on X

Originally posted by Zhangzhao Liang, Xiaofen Xing, Mingyue Yang, Wenlve Zhou, Xiangmin Xu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses