Advances in Visual Question Answering and Document Information Extraction

The field of Visual Question Answering (VQA) and document information extraction is moving towards more robust and generalizable methods. Recent developments focus on improving the accuracy and reliability of VQA systems by incorporating commonsense knowledge, jointly extracting multiple fields, and evaluating models based on their groundedness. The integration of external knowledge sources and structured reasoning techniques is becoming increasingly important. Furthermore, there is a growing need for evaluation methodologies that account for the semantic and multimodal characteristics of model outputs. Noteworthy papers include: MAGIC-VQA, which introduces a novel framework for integrating commonsense knowledge with Large Vision-Language Models, and FRASE, which proposes a structured representation approach for generalizable SPARQL query generation. Where is this coming from? also presents a new evaluation methodology that accounts for the groundedness of predictions, providing a more accurate assessment of model performance.

Sources

Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Where is this coming from? Making groundedness count in the evaluation of Document VQA models

FRASE: Structured Representations for Generalizable SPARQL Query Generation

Built with on top of