Multimodal Large Language Models (MLLMs)

Current Developments in Multimodal Large Language Models (MLLMs)

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly pushed the boundaries of what these models can achieve, particularly in integrating visual and textual information for complex reasoning tasks. The field is moving towards enhancing the models' ability to perform robust, multimodal reasoning, understand ambiguous instructions, and handle real-world scenarios with greater accuracy and consistency.

General Direction of the Field

  1. Enhanced Multimodal Reasoning: There is a growing emphasis on developing benchmarks and frameworks that assess and improve the abductive reasoning capabilities of MLLMs. This includes understanding cause-and-effect relationships from visual data and making plausible inferences, which is crucial for applications like accident prevention and video verification.

  2. Ambiguity Resolution: The field is increasingly focused on resolving lexical and visual ambiguities through multimodal inputs. This involves creating datasets and models that can effectively use visual cues to disambiguate textual information, a capability that is essential for tasks involving puns, jokes, and complex instructions.

  3. Structural Understanding and Instruction Following: Advances are being made in teaching MLLMs to understand and follow complex visual instructions, such as building LEGO structures or performing assembly tasks. These models are being trained to generate their own instructions and reason about the assembly process step-by-step, which is a significant leap towards more practical AI applications.

  4. Contextual Integration in VQA: Research is exploring how different modalities (text and image) interact in Visual Question Answering (VQA) tasks. The goal is to understand how complementary information from both modalities can improve reasoning quality and answer accuracy, while also identifying scenarios where contradictory information harms performance.

  5. Modality-Agnostic Vision Understanding: There is a push to develop models that can understand visual semantics embedded in text strings, such as ASCII art. This involves benchmarking models on tasks that require recognizing visual concepts from textual inputs, highlighting the need for better training techniques to enhance information fusion among modalities.

  6. Real-World Robustness and Consistency: The robustness of MLLMs to real-world corruptions and their consistency in producing semantically similar responses are being rigorously tested. Benchmarks are being developed to evaluate these aspects, ensuring that MLLMs can perform reliably in practical applications.

Noteworthy Innovations

  • NL-Eye: Introduces a benchmark for assessing VLMs' visual abductive reasoning skills, highlighting a deficiency in current models' ability to infer outcomes and causes from visual data.
  • UNPIE: A novel benchmark for resolving lexical ambiguities using multimodal inputs, showing that models improve significantly when given visual context.
  • Visual-O1: Proposes a multi-modal multi-turn chain-of-thought reasoning framework to address the limitations of models in understanding ambiguous instructions.
  • MM-R$^3$: Focuses on the consistency of MLLMs, proposing a benchmark that evaluates consistency alongside accuracy, and introduces a mitigation strategy to improve consistency.

These developments underscore the rapid evolution of MLLMs towards more sophisticated, reliable, and context-aware systems, capable of handling a wide range of real-world tasks with greater precision and robustness.

Sources

NL-Eye: Abductive NLI for Images

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!

Learning to Build by Building Your Own Instructions

Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities

Visual Perception in Text Strings

AiBAT: Artificial Intelligence/Instructions for Build, Assembly, and Test

Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Computer Vision Intelligence Test Modeling and Generation: A Case Study on Smart OCR

LCM: Log Conformal Maps for Robust Representation Learning to Mitigate Perspective Distortion

Gamified crowd-sourcing of high-quality data for visual fine-tuning

TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions

A Framework for Reproducible Benchmarking and Performance Diagnosis of SLAM Systems

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?

Multimodal Situational Safety

HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding

Personalized Visual Instruction Tuning

Understanding the AI-powered Binary Code Similarity Detection

Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Built with on top of