The field of video-language understanding is moving towards more efficient and effective models that can capture fine-grained spatial relationships and long-range temporal dynamics. Recent developments have focused on targeted alignment and optimization techniques to improve model performance, rather than relying on massive pretraining or architectural modifications. Noteworthy papers include VideoPASTA, which achieves significant performance gains on standard video benchmarks through targeted preference optimization, and Eagle 2.5, which introduces a generalist framework for long-context multimodal learning and demonstrates substantial improvements on long-context multimodal benchmarks. Additionally, the IV-Bench benchmark highlights the importance of image-grounded video perception and reasoning in multimodal large language models, revealing key factors influencing model performance and providing valuable insights for future research.