Advances in Multimodal Reasoning and Visual Grounding

The field of multimodal learning is rapidly advancing, with a focus on improving visual reasoning and grounding capabilities. Recent research has highlighted the importance of developing models that can effectively integrate visual and textual information to reason about complex scenes and objects. A key challenge in this area is the ability to reason about occluded objects, with current models struggling to accurately count and identify objects in images with occlusions. To address this, new benchmarks and datasets have been proposed, such as CAPTURe and VisuLogic, which provide a more comprehensive evaluation of visual reasoning capabilities. Additionally, there is a growing interest in developing models that can reason about spatial relations and perspective, with frameworks such as Abstract Perspective Change (APC) showing promising results. Notable papers in this area include VisuLogic, which introduces a benchmark for evaluating visual reasoning in multi-modal large language models, and Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation, which proposes a framework for perspective-aware reasoning in vision-language models. Overall, the field is moving towards more sophisticated and human-like visual reasoning capabilities, with a focus on developing models that can effectively integrate multiple sources of information and reason about complex scenes and objects.

Sources

Visual Intention Grounding for Egocentric Assistants

The Human Robot Social Interaction (HSRI) Dataset: Benchmarking Foundational Models' Social Reasoning

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

Visual Place Cell Encoding: A Computational Model for Spatial Representation and Cognitive Mapping

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Vision language models are unreliable at trivial spatial cognition

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

Built with on top of