Advancements in Vision-Language Models for Autonomous Driving

The field of autonomous driving is increasingly leveraging the capabilities of Vision-Language Models (VLMs) and Large Language Models (LLMs) to enhance scene understanding, decision-making, and motion planning. A notable trend is the shift towards more efficient, smaller-scale models that can be deployed in real-world scenarios without the prohibitive computational costs associated with larger models. Innovations include the application of VLMs to improve multi-modality information processing for cooperative dispatching and motion planning in Autonomous Mobility-on-Demand (AMoD) systems, and the distillation of knowledge from multi-modal LLMs to vision-based planners for safer and more efficient autonomous driving. Additionally, there is a focus on generative planning models that integrate 3D-vision language pre-training to bridge the gap between visual perception and linguistic understanding, aiming for more robust and generalizable autonomous driving systems.

Noteworthy Papers

DriVLM: Explores the utility of small-scale MLLMs in autonomous driving, advancing their application in real-world scenarios.
CoDriveVLM: Introduces a VLM-enhanced framework for AMoD systems, improving dispatching and motion planning with high-fidelity simulations.
Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving: Analyzes effective knowledge distillation for semantic scene representation, enhancing downstream decision-making.
Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving: Proposes a novel model for end-to-end autonomous driving with strong generalization and real-time potential.
Distilling Multi-modal Large Language Models for Autonomous Driving: Presents DiMA, a system that distills LLM knowledge to vision-based planners, significantly improving planning efficiency and safety.

Advancements in Vision-Language Models for Autonomous Driving

Noteworthy Papers

Sources