Efficient Video Understanding and Processing

The field of video understanding and processing is moving towards more efficient and effective methods for handling large amounts of video data. Key trends include the development of models that can handle real-time video processing, such as streaming video understanding and online video interaction. Another area of focus is the improvement of video quality assessment and enhancement techniques, including the use of multimodal approaches that combine visual and audio information. Additionally, researchers are exploring new methods for dataset distillation and compression, which can help reduce the computational costs associated with training and deploying video understanding models. Noteworthy papers in this area include ProVideLLM, which achieves state-of-the-art results on procedural video understanding tasks while reducing memory and compute requirements, and TimeChat-Online, which introduces a novel approach for real-time video interaction and achieves an 82.8% reduction in video tokens while maintaining 98% performance.

Sources

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Plug-and-Play Versatile Compressed Video Enhancement

MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Latent Video Dataset Distillation

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams

Built with on top of