The field of multi-modal large language models (MLLMs) is moving towards more efficient processing of visual data, particularly in streaming video contexts. Researchers are exploring innovative methods to reduce computational overhead and improve real-time performance, such as foveated instance segmentation, tokenization of gaze data, and slow-fast architectures. These advancements have the potential to enable more seamless human-computer interaction and human-augmentation applications. Notable papers in this area include:
- GazeLLM, which optimizes first-person video analysis by integrating eye-tracking data, and
- Slow-Fast Architecture for Video Multi-Modal Large Language Models, which proposes a novel architecture that balances temporal resolution and spatial detail under limited compute budget.