The field of 3D scene understanding is rapidly advancing, driven by the development of vision-language models and their applications in various areas such as urban analytics, embodied navigation, and autonomous driving. Recent research has focused on expanding the use of vision-language models to urban-scale environments, generating expressive 3D captions, and developing benchmarks for evaluating 3D scene understanding. Noteworthy papers include OpenCity3D, which showcases impressive zero-shot and few-shot capabilities in urban analytics, and ExCap3D, which introduces a new task of expressive 3D captioning. Other notable papers, such as IRef-VLA and EMPLACE, contribute to the development of benchmarks and methods for 3D scene understanding, while MLLM-For3D and P3Nav propose innovative approaches to 3D reasoning segmentation and embodied navigation. These advances have significant implications for applications such as planning, policy, and environmental monitoring, and demonstrate the potential of vision-language models to drive progress in 3D scene understanding.