Advances in Visual Question Answering

The field of Visual Question Answering (VQA) is moving towards more fine-grained and hierarchical approaches, with a focus on improving the understanding of complex visual-grounded questions. Recent studies have highlighted the importance of capturing the relationships between images and texts, as well as the need for more effective debiasing techniques. Noteworthy papers include:

  • HiCA-VQA, which proposes a hierarchical modeling approach with cross-attention fusion for medical visual question answering, achieving state-of-the-art results on the Rad-Restruct benchmark.
  • QIRL, which introduces a novel framework for optimized question-image relation learning, demonstrating effectiveness and generalization ability on VQA-CPv2 and VQA-v2 benchmarks.
  • UniRVQA, which presents a unified framework for retrieval-augmented vision question answering via self-reflective joint training, achieving competitive performance against state-of-the-art models.
  • CoDI-IQA, which proposes a robust NR-IQA approach that captures the complex interactions between content and distortions, outperforming state-of-the-art methods in terms of prediction accuracy and generalization ability.

Sources

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning

UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training

Content-Distortion High-Order Interaction for Blind Image Quality Assessment

LiveVQA: Live Visual Knowledge Seeking

Built with on top of