Advances in Video Understanding

The field of video understanding is rapidly advancing, with a focus on developing innovative methods to analyze and interpret long videos. Recent research has centered around leveraging large language models (LLMs) and multi-modal approaches to improve action recognition, action grounding, and video representation learning. Notably, the integration of LLMs with vision-language models has shown promising results in handling complex action tasks and outperforming traditional methods. Furthermore, novel frame selection strategies and self-reflective sampling methods have been proposed to enhance the efficiency and accuracy of long video understanding.

Some noteworthy papers in this area include: CountLLM, which proposes a large language model-based framework for repetitive action counting, demonstrating superior performance and generalization capabilities. LLaVAction, which evaluates and improves multi-modal large language models for action recognition, achieving state-of-the-art results on several benchmarks. FALCONEye, which introduces a novel video agent that combines a vision-language model and a large language model to search and localize relevant information in hour-long videos, showing superior performance on the FALCON-Bench benchmark. VideoGEM, which proposes a training-free spatial action grounding method based on pre-trained image- and video-language backbones, outperforming current trained state-of-the-art approaches. Self-ReS, which presents a non-linear spatiotemporal self-reflective sampling method to dynamically select key video fragments, improving long-video task accuracy and achieving faster inference speed. BOLT, which introduces a method to boost large vision-language models without additional training through a comprehensive study of frame selection strategies, increasing accuracy on several benchmarks. AMNAR, which proposes an adaptive multiple normal action representation framework for error detection in procedural tasks, achieving state-of-the-art performance.

Sources

CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model

LLaVAction: evaluating and training multi-modal large language models for action recognition

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

VideoGEM: Training-free Action Grounding in Videos

Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks

Built with on top of