Advances in Video-Language Alignment and Temporal Reasoning

The field of video-language understanding is rapidly advancing, with a focus on improving temporal reasoning and alignment between video and language models. Recent developments have introduced new frameworks and methods for optimizing video-language alignment, such as curriculum learning and preference optimization. These approaches have shown significant improvements in performance across various benchmarks. Notably, the use of synthetic videos and text-to-video generation models has been explored as a means to enhance video-language alignment. Additionally, researchers have been investigating the problem of hallucination in large multimodal models and propose mitigation methods such as supervised reasoning fine-tuning and direct preference optimization.

Some noteworthy papers include: TEMPO, which proposes a systematic framework for enhancing video-language models' temporal reasoning capabilities through direct preference optimization. EgoToM, which introduces a new video question-answering benchmark for evaluating theory of mind reasoning from egocentric videos. VPO, which presents a principled framework for optimizing prompts based on harmlessness, accuracy, and helpfulness to improve the safety and quality of generated videos.

Sources

TEMPO: Temporal Preference Optimization of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment

Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization

What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

Can Text-to-Video Generation help Video-Language Alignment?

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

Video-R1: Reinforcing Video Reasoning in MLLMs

How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

Built with on top of