3D Perception and Manipulation in Embodied Intelligence

Advances in 3D Perception and Manipulation for Embodied Intelligence

Recent developments in the field of embodied intelligence have seen significant advancements in 3D perception and manipulation, driven by innovative approaches that bridge the gap between 2D image understanding and 3D spatial reasoning. The integration of advanced 3D perception models with foundation models has enabled more accurate and robust spatial understanding, which is crucial for tasks such as object detection, visual grounding, and long-horizon task execution in diverse environments. These advancements are particularly notable in the context of mobile manipulation, where robots are required to generalize across object configurations and perform complex manipulation tasks beyond simple pick-and-place operations.

One of the key innovations is the development of models that leverage generative and foundation models to enhance the characterization and manipulation of deformable objects, addressing the challenges posed by semi-fluid and fluid-like materials. These models not only improve the detection of keypoints necessary for manipulation but also reduce the dependency on pixel-level information, thereby enhancing the robustness and generalizability of the solutions.

Another significant trend is the use of causal and geometric reasoning within data augmentation frameworks for imitation learning, which has shown to improve policy performance, generalization, and sample efficiency. These methods, which incorporate invariance, equivariance, and causality, provide a principled approach to data augmentation, bridging the gap between geometric symmetries and causal reasoning.

In the realm of 3D-language feature fields, the introduction of models that can generalize to unseen environments and enable real-time construction and dynamic updates has opened new possibilities for embodied tasks such as vision-and-language navigation and situated question answering. These models integrate semantic and spatial relationships through multi-scale encoders, producing representations that are aligned with multi-granularity language.

Noteworthy papers in this area include:

  • A novel image-centric 3D perception model that leverages expressive image features and outperforms state-of-the-art results on the EmbodiedScan benchmark.
  • A method for long-horizon loco-manipulation in diverse environments, demonstrating higher grasping success rates and practical robot applications.
  • A self-supervised framework for monocular depth and pose estimation in endoscopy, significantly outperforming existing methods in challenging conditions.

Sources

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

WildLMa: Long Horizon Loco-Manipulation in the Wild

Leveraging Foundation Models To learn the shape of semi-fluid deformable objects

RoCoDA: Counterfactual Data Augmentation for Data-Efficient Robot Learning from Demonstrations

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

GMFlow: Global Motion-Guided Recurrent Flow for 6D Object Pose Estimation

Spatially Visual Perception for End-to-End Robotic Learning

RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training

SnapMem: Snapshot-based 3D Scene Memory for Embodied Exploration and Reasoning

Geometric Point Attention Transformer for 3D Shape Reassembly

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

Towards Cross-device and Training-free Robotic Grasping in 3D Open World

PDZSeg: Adapting the Foundation Model for Dissection Zone Segmentation with Visual Prompts in Robot-assisted Endoscopic Submucosal Dissection

GAPartManip: A Large-scale Part-centric Dataset for Material-Agnostic Articulated Object Manipulation

G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

XR-MBT: Multi-modal Full Body Tracking for XR through Self-Supervision with Learned Depth Point Cloud Registration

Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation

Built with on top of