Efficient Visual Language Models

The field of visual language models is moving towards greater efficiency and flexibility, with a focus on reducing computational overhead and improving performance on fine-grained visual understanding tasks. Recent innovations include the development of adaptive tokenization strategies, which allow models to adjust the number of vision tokens based on task complexity, and the use of novel training paradigms that enhance performance across varying token counts. Additionally, researchers are exploring the use of conditional token reduction and mixture of multi-modal experts to improve visual reasoning capabilities while minimizing computational costs. Notable papers in this area include TokenFLEX, which consistently outperforms fixed-token counterparts, and SmolVLM, which achieves state-of-the-art performance with minimal memory footprint. LEO-MINI is also noteworthy for its ability to reduce visual tokens while boosting visual reasoning capabilities.

Sources

TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference

Window Token Concatenation for Efficient Visual Large Language Models

LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts

EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively

SmolVLM: Redefining small and efficient multimodal models

Ternarization of Vision Language Models for use on edge devices

Built with on top of