Advancements in Vision-Language Models: Addressing Hallucination, Negation, and Efficiency

The recent developments in the field of vision-language models (VLMs) and multimodal large language models (MLLMs) have been marked by significant advancements aimed at addressing key challenges such as hallucination, negation awareness, and efficient adaptation to new tasks. A notable trend is the integration of 3D representations and multiview images to enhance the visual grounding of VLMs, thereby reducing hallucinations related to object attributes. Additionally, there's a growing emphasis on improving the models' understanding of negation through innovative data generation pipelines and benchmark creation. The field is also witnessing a shift towards more efficient few-shot and zero-shot learning techniques, with novel approaches leveraging kernel perspectives and embedding-driven diversity sampling to improve synthetic data generation and model adaptation. Furthermore, the development of unified frameworks that combine visual understanding and generation within a single autoregressive model represents a significant leap forward, offering versatility across various vision-centric tasks. The exploration of training-free methods for knowledge mining and the introduction of language-guided vision token pruning techniques are also noteworthy, as they aim to reduce computational overhead while maintaining model performance. Lastly, the application of VLMs in specialized domains such as medical image classification and manufacturing quality control underscores the models' adaptability and potential for real-world impact.

Noteworthy Papers

  • Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions: Introduces MIAVLM, a method leveraging multiview images and negative instructions to reduce hallucinations in LVLMs.
  • FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization: Proposes FiLo++ for improved anomaly detection through fine-grained descriptions and deformable localization.
  • Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP: Develops NegationCLIP, enhancing CLIP's negation understanding with a new benchmark, NegRefCOCOg.
  • ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models: Offers a theoretical understanding and enhancement of few-shot adaptation methods, introducing ProKeR.
  • VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model: Presents VARGPT, a model uniting visual understanding and generation within a single autoregressive framework.
  • LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models: Introduces LVPruning, reducing computational burden in MLLMs by pruning vision tokens based on language interaction.

Sources

Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions

FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model

VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

Adapting OpenAI's CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control: An Expository Case Study with Multiple Application Examples

TeD-Loc: Text Distillation for Weakly Supervised Object Localization

RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering

LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models

Dual-Modal Prototype Joint Learning for Compositional Zero-Shot Learning

Built with on top of