Efficient Video Understanding with Large Language Models

The field of video understanding is moving towards efficient and dynamic representation of video content, enabling large language models to interpret video data effectively. Recent developments focus on reducing computational costs and improving performance, particularly in scenarios requiring extreme token compression. Notable advancements include the proposed frameworks that disentangle video representations, separating visual embeddings from motion information, and introduce novel attention mechanisms to integrate motion features without increasing token length. Another area of progress is the development of encoder-free models that directly model nuanced video-language interactions, reducing FLOPs and inference latency. Furthermore, research on video anomaly detection has led to the creation of online surveillance assistants that unify prediction, detection, and analysis within a single framework, supporting real-time inference on streaming videos. Some noteworthy papers include:

  • Token Dynamics, which reduces token count to 0.07% of the original tokens with minor performance drop.
  • VALLR, which proposes a phoneme-centric framework for Visual Automatic Speech Recognition, achieving state-of-the-art performance on two challenging datasets.
  • Mobile-VideoGPT, which presents an efficient multimodal framework with real-time throughput, outperforming existing state-of-the-art models.
  • AssistPDA, which introduces an online video anomaly surveillance assistant that enables real-time inference and interactive user engagement.

Sources

Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models

Anomize: Better Open Vocabulary Video Anomaly Detection

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Breaking the Encoder Barrier for Seamless Video-Language Understanding

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

VALLR: Visual ASR Language Model for Lip Reading

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis

Built with on top of