The field of video understanding is rapidly advancing, with a focus on developing innovative methods to analyze and interpret long videos. Recent research has centered around leveraging large language models (LLMs) and multi-modal approaches to improve action recognition, action grounding, and video representation learning. Notably, the integration of LLMs with vision-language models has shown promising results in handling complex action tasks and outperforming traditional methods. Furthermore, novel frame selection strategies and self-reflective sampling methods have been proposed to enhance the efficiency and accuracy of long video understanding.
Some noteworthy papers in this area include: CountLLM, which proposes a large language model-based framework for repetitive action counting, demonstrating superior performance and generalization capabilities. LLaVAction, which evaluates and improves multi-modal large language models for action recognition, achieving state-of-the-art results on several benchmarks. FALCONEye, which introduces a novel video agent that combines a vision-language model and a large language model to search and localize relevant information in hour-long videos, showing superior performance on the FALCON-Bench benchmark. VideoGEM, which proposes a training-free spatial action grounding method based on pre-trained image- and video-language backbones, outperforming current trained state-of-the-art approaches. Self-ReS, which presents a non-linear spatiotemporal self-reflective sampling method to dynamically select key video fragments, improving long-video task accuracy and achieving faster inference speed. BOLT, which introduces a method to boost large vision-language models without additional training through a comprehensive study of frame selection strategies, increasing accuracy on several benchmarks. AMNAR, which proposes an adaptive multiple normal action representation framework for error detection in procedural tasks, achieving state-of-the-art performance.