The field of Visual Question Answering (VQA) is moving towards more fine-grained and hierarchical approaches, with a focus on improving the understanding of complex visual-grounded questions. Recent studies have highlighted the importance of capturing the relationships between images and texts, as well as the need for more effective debiasing techniques. Noteworthy papers include:
- HiCA-VQA, which proposes a hierarchical modeling approach with cross-attention fusion for medical visual question answering, achieving state-of-the-art results on the Rad-Restruct benchmark.
- QIRL, which introduces a novel framework for optimized question-image relation learning, demonstrating effectiveness and generalization ability on VQA-CPv2 and VQA-v2 benchmarks.
- UniRVQA, which presents a unified framework for retrieval-augmented vision question answering via self-reflective joint training, achieving competitive performance against state-of-the-art models.
- CoDI-IQA, which proposes a robust NR-IQA approach that captures the complex interactions between content and distortions, outperforming state-of-the-art methods in terms of prediction accuracy and generalization ability.