Large Language Models and Multimodal Retrieval

Report on Current Developments in the Research Area

General Trends and Innovations

The recent advancements in the research area are characterized by a significant shift towards leveraging the capabilities of Large Language Models (LLMs) for various tasks, often in zero-shot or unsupervised settings. This trend is driven by the need to overcome the limitations of traditional supervised learning methods, which require extensive annotated data and struggle with generalization to new domains or tasks. The focus is increasingly on developing frameworks that can iteratively elicit strong model capabilities from unlabeled data, thereby reducing the dependency on gold labels.

One of the key innovations is the exploration of multimodal approaches, particularly in tasks like open-domain question answering (QA) where speech and text data are integrated seamlessly. These approaches aim to bypass the limitations of traditional pipelines that rely on automatic speech recognition (ASR) models, which are often resource-intensive and error-prone. Instead, end-to-end multimodal retrievers are being developed that can directly process spoken questions, offering a more robust and efficient solution.

Another notable development is the optimization of retrieval systems, particularly in the context of multi-vector retrieval methods like ColBERT. Researchers are focusing on reducing the storage and memory footprint of these systems without compromising retrieval performance. Techniques such as token pooling are being introduced to significantly reduce the number of vectors that need to be stored, making these methods more practical for real-world applications.

Zero-shot learning is also gaining traction, especially in tasks like Named Entity Recognition (NER) and Dialogue State Tracking (DST). The ability of LLMs to perform these tasks without prior training on specific datasets is being harnessed to create systems that can generalize to new entity types or dialogue contexts. This is particularly relevant for languages other than English, where annotated data is often scarce.

Finally, the field is seeing a move towards more sophisticated model selection techniques in text ranking. Instead of relying on human intuition or brute-force fine-tuning, researchers are developing methods that estimate transferability based on the model's ranking capability. These methods aim to capture subtle differences between models and select the most effective one for a given dataset, thereby improving overall performance.

Noteworthy Papers

  • Zero-to-Strong Generalization: Introduces a novel paradigm for iteratively eliciting strong capabilities of LLMs using unlabeled data, demonstrating significant potential for both in-context learning and fine-tuning.

  • Multimodal Dense Retrieval: Proposes an ASR-free, end-to-end multimodal retriever for speech-based QA, showing promising performance on shorter questions.

  • Token Pooling for Multi-Vector Retrieval: Presents a clustering-based approach to reduce the footprint of ColBERT indexes by up to 75% with minimal performance degradation.

  • Zero-Shot NER for Italian: Introduces SLIMER-IT, an instruction-tuning approach for zero-shot NER in Italian, outperforming state-of-the-art models on unseen entity tags.

  • Zero-Shot Open-Vocabulary DST: Proposes a unified pipeline for dialogue understanding that integrates domain classification and DST, achieving up to 20% better Joint Goal Accuracy with fewer LLM API requests.

  • Adaptive Ranking Transferability: Develops a method for model selection in text ranking that significantly improves over previous transferability estimation methods and human intuition.

  • Unsupervised Text Representation Learning: Introduces an instruction-tuning approach for zero-shot dense retrieval, significantly improving performance on low-resource settings with smaller model sizes.

Sources

Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels

A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

SLIMER-IT: Zero-Shot NER on Italian Language

A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding

Leveraging Estimated Transferability Over Human Intuition for Model Selection in Text Ranking

Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval

Built with on top of