The recent advancements in the field of embodied AI and vision-and-language navigation (VLN) have shown a significant shift towards leveraging scalable and self-refining data generation methods. Researchers are increasingly focusing on creating autonomous systems that can iteratively improve their training datasets without human intervention. This approach not only enhances the quality and diversity of training data but also leads to more robust and generalizable models. Notably, the integration of real-world data sources, such as web-based videos and online tutorials, has been pivotal in expanding the scope of VLN tasks to include open-world scenarios and complex digital environments. These innovations are driving the development of zero-shot learning capabilities and advancing the performance of navigation agents beyond human levels in controlled settings. The scalability and cost-efficiency of these new methods are also making large-scale training more feasible, paving the way for more autonomous and capable AI systems.
Noteworthy papers include one that introduces a self-refining data flywheel, achieving superior performance in VLN tasks, and another that leverages web-based room tour videos for geometry-aware instruction tuning, enabling significant improvements in VLN tasks. Additionally, a paper on generating GUI agent trajectories using web tutorials demonstrates improved grounding and planning performance.