3D-DLP Learns Self-Supervised Object-Centric 3D Scene Representations

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel· June 19, 2026 View original

Summary

This paper introduces 3D-DLP, a self-supervised model that decomposes 3D scene observations into a set of 3D latent particles, each representing a distinct object with disentangled attributes. The learned latent space is interpretable and controllable, improving robotic manipulation performance over baselines.

Understanding and representing complex 3D scenes is a fundamental challenge in AI, particularly for robotics. This research introduces 3D-DLP (3D Deep Latent Particles), a self-supervised object-centric representation learning model. It is designed to decompose scene-level RGB-D or voxel observations into a collection of 3D latent particles. Each of these particles encodes disentangled attributes, such as 3D keypoint position, bounding box dimensions, and appearance features, effectively representing a distinct entity within the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective, meaning it learns without explicit human labels for objects. Demonstrations on both simulated and real-world datasets confirm that the learned latent space is highly interpretable and controllable. By manipulating the positions of these particles and then decoding them, novel scene configurations can be generated. Furthermore, leveraging these compact 3D latent particles for downstream robotic manipulation tasks significantly improves performance compared to baseline methods that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure.

Why it matters

For professionals in robotics, computer vision, and virtual reality, 3D-DLP offers a more efficient and interpretable way to represent complex 3D environments. This can lead to more robust robotic manipulation, better scene understanding for autonomous systems, and more intuitive tools for scene generation and editing.

How to implement this in your domain

  1. 1Explore integrating 3D-DLP or similar object-centric 3D representation learning models into your robotic perception systems.
  2. 2Utilize the disentangled attributes of 3D latent particles for more interpretable scene understanding and manipulation planning.
  3. 3Apply the self-supervised learning approach to reduce reliance on extensive labeled 3D datasets for scene decomposition.
  4. 4Leverage the controllable latent space for generating novel scene configurations or for data augmentation in simulation environments.

Who benefits

RoboticsAutonomous VehiclesVirtual RealityGamingIndustrial Automation

Key takeaways

  • 3D-DLP is a self-supervised model for learning object-centric 3D scene representations.
  • It decomposes scenes into 3D latent particles, each with disentangled attributes like position and size.
  • The learned latent space is interpretable and controllable, enabling novel scene generation.
  • Leveraging these compact representations improves performance in robotic manipulation tasks.

Original post by Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

"arXiv:2606.19451v1 Announce Type: new Abstract: We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, ea…"

View on X

Originally posted by Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses