Efficient Processing of Visual Data for Interactive Applications

The field of multi-modal large language models (MLLMs) is moving towards more efficient processing of visual data, particularly in streaming video contexts. Researchers are exploring innovative methods to reduce computational overhead and improve real-time performance, such as foveated instance segmentation, tokenization of gaze data, and slow-fast architectures. These advancements have the potential to enable more seamless human-computer interaction and human-augmentation applications. Notable papers in this area include:

  • GazeLLM, which optimizes first-person video analysis by integrating eye-tracking data, and
  • Slow-Fast Architecture for Video Multi-Modal Large Language Models, which proposes a novel architecture that balances temporal resolution and spatial detail under limited compute budget.

Sources

Foveated Instance Segmentation

Tokenization of Gaze Data

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

GazeLLM: Multimodal LLMs incorporating Human Visual Attention

Slow-Fast Architecture for Video Multi-Modal Large Language Models

Built with on top of