Generative Models for Vision and Robotics

Report on Recent Developments in the Research Area

General Direction of the Field

The recent advancements in the research area are marked by a significant shift towards more robust and versatile models, particularly in the domains of self-supervised learning, unsupervised segmentation, depth estimation, and generative image synthesis. The field is moving towards integrating advanced generative models, such as diffusion models and autoregressive transformers, with traditional discriminative approaches to address complex tasks in computer vision and robotics.

One of the key trends is the elimination of tokenization and discretization in learning objectives, which has been shown to improve performance in tasks like masked particle modeling and image generation. This shift towards more direct and continuous representations is enabling models to capture finer details and generalize better across diverse datasets.

Another notable trend is the incorporation of multi-modal data, such as combining RGB and depth images, to enhance the robustness and accuracy of perception tasks. This is particularly evident in semantic segmentation and depth estimation, where models are being designed to handle noisy and incomplete data more effectively.

The use of diffusion models for depth sensing and semantic segmentation is also gaining traction, as these models demonstrate superior performance in challenging scenarios with complex materials and lighting conditions. The integration of geometric constraints from traditional vision methods with learning-based approaches is further enhancing the reliability of these models in real-world applications.

Noteworthy Papers

  1. Is Tokenization Needed for Masked Particle Modelling?
    This paper introduces a novel approach to masked particle modeling that eliminates tokenization, significantly improving performance across various downstream tasks in high-energy physics.

  2. D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation
    The proposed framework unifies depth estimation and restoration, achieving state-of-the-art performance in challenging scenarios with translucent or specular surfaces.

  3. DepthART: Monocular Depth Estimation as Autoregressive Refinement Task
    This work introduces an autoregressive depth estimation model that outperforms traditional methods by leveraging dynamic target formulations and multi-modal guidance.

  4. Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer
    The proposed diffusion-based framework for RGB-D semantic segmentation achieves state-of-the-art performance, particularly in challenging scenarios with noisy depth measurements.

  5. MaskBit: Embedding-free Image Generation via Bit Tokens
    This study demonstrates a novel embedding-free image generation method using bit tokens, achieving a new state-of-the-art FID score on the ImageNet benchmark with a compact model.

Sources

Is Tokenization Needed for Masked Particle Modelling?

CUS3D :CLIP-based Unsupervised 3D Segmentation via Object-level Denoise

D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation

DepthART: Monocular Depth Estimation as Autoregressive Refinement Task

Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer

MaskBit: Embedding-free Image Generation via Bit Tokens

Built with on top of