Advances in Multimodal Understanding and Generation

The field of multimodal research is witnessing significant advancements, driven by the development of large language models and their applications in vision-language understanding. A key trend is the increasing focus on incorporating diverse perspectives and cultural backgrounds into multimodal models, aiming to reduce perceptual biases and improve flexibility. Another notable direction is the exploration of emotional expressions and understanding in multimodal interactions, which has the potential to enhance human-machine interactions and empathy. Furthermore, researchers are investing effort into improving the compositional understanding of multimodal models, enabling them to better capture relationships and attributes between objects and texts. Noteworthy papers in this area include EmoSEM, which introduces a novel framework for segmenting and explaining emotion stimuli in visual art, and AdaViP, which proposes an adaptive vision-enhanced preference optimization method for aligning multimodal large language models with human preferences.

Sources

A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling

EmoSEM: Segment and Explain Emotion Stimuli in Visual Art

AI with Emotions: Exploring Emotional Expressions in Large Language Models

AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization

Cost-Effective Text Clustering with Large Language Models

Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis

EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment

Out-of-the-Box Conditional Text Embeddings from Large Language Models

Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering

Decoupled Global-Local Alignment for Improving Compositional Understanding

M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs