The field of multimodal research is witnessing significant advancements, driven by the development of large language models and their applications in vision-language understanding. A key trend is the increasing focus on incorporating diverse perspectives and cultural backgrounds into multimodal models, aiming to reduce perceptual biases and improve flexibility. Another notable direction is the exploration of emotional expressions and understanding in multimodal interactions, which has the potential to enhance human-machine interactions and empathy. Furthermore, researchers are investing effort into improving the compositional understanding of multimodal models, enabling them to better capture relationships and attributes between objects and texts. Noteworthy papers in this area include EmoSEM, which introduces a novel framework for segmenting and explaining emotion stimuli in visual art, and AdaViP, which proposes an adaptive vision-enhanced preference optimization method for aligning multimodal large language models with human preferences.