Advances in 3D Scene Understanding and Vision-Language Models

The field of 3D scene understanding is rapidly advancing, driven by the development of vision-language models and their applications in various areas such as urban analytics, embodied navigation, and autonomous driving. Recent research has focused on expanding the use of vision-language models to urban-scale environments, generating expressive 3D captions, and developing benchmarks for evaluating 3D scene understanding. Noteworthy papers include OpenCity3D, which showcases impressive zero-shot and few-shot capabilities in urban analytics, and ExCap3D, which introduces a new task of expressive 3D captioning. Other notable papers, such as IRef-VLA and EMPLACE, contribute to the development of benchmarks and methods for 3D scene understanding, while MLLM-For3D and P3Nav propose innovative approaches to 3D reasoning segmentation and embodied navigation. These advances have significant implications for applications such as planning, policy, and environmental monitoring, and demonstrate the potential of vision-language models to drive progress in 3D scene understanding.

Sources

OpenCity3D: What do Vision-Language Models know about Urban Environments?

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Hi-ALPS -- An Experimental Robustness Quantification of Six LiDAR-based Object Detection Systems for Autonomous Driving

IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes

EMPLACE: Self-Supervised Urban Scene Change Detection

MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations

Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion

AskSport: Web Application for Sports Question-Answering

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Efficient Joint Prediction of Multiple Future Tokens

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving