The field of medical image understanding is rapidly advancing with the development of innovative frameworks and models that can efficiently process and integrate multiple modalities of medical images and videos. Recent research has focused on creating flexible and adaptable models that can learn from diverse data sources, including publicly available educational videos and heterogeneous datasets. These advancements have led to significant improvements in performance, with models achieving state-of-the-art results on various benchmarks. Notably, the development of unified frameworks for multimodal medical understanding and efficient vision-language models has enabled seamless integration of textual data with diverse visual modalities. Furthermore, novel approaches to parameter-efficient adaptation and dynamic merging of visual tokens have reduced computational burdens while maintaining high task performance. Noteworthy papers include: Efficient Parameter Adaptation for Multi-Modal Medical Image Segmentation and Prognosis, which proposes a parameter-efficient multi-modal adaptation framework for lightweight upgrading of transformer-based segmentation models. OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding, which presents a unified framework for multimodal medical understanding that achieves state-of-the-art performance on 7 benchmarks.