Multimodal and Multilingual AI

Report on Current Developments in the Research Area

General Direction of the Field

The recent advancements in the research area have been marked by a significant shift towards leveraging multimodal and multilingual approaches to address complex challenges across various domains. The integration of multimodal large language models (MLLMs) and transformer-based architectures has shown promise in enhancing the performance of tasks traditionally dominated by single-modality or monolingual approaches. This trend is particularly evident in areas such as sentiment analysis, image retrieval, and semantic parsing, where the fusion of visual and textual data, along with cross-lingual capabilities, has led to more robust and versatile solutions.

One of the key innovations is the application of MLLMs to tasks that require deep understanding and generation of content across different languages and modalities. This is exemplified by the development of models that can generate detailed recipes from food images, recognize sign language gestures, and perform cross-lingual sentiment analysis without the need for target-language training data. These models are not only improving accuracy but also expanding the scope of applicability to less-resourced languages and diverse cultural contexts.

Another notable direction is the enhancement of information retrieval and traceability in multilingual software projects. Researchers are exploring novel strategies to address the challenges posed by multilingualism, such as term inconsistency and semantic gaps, by leveraging translation variants and cross-reference mechanisms. These advancements are crucial for global software development, where collaboration across different linguistic backgrounds is increasingly common.

The field is also witnessing a push towards more robust and interpretable models, particularly in the context of large language models (LLMs). Approaches like Gaussian Concept Subspace (GCS) are being developed to provide more stable and reliable representations of concepts, which can be beneficial for tasks requiring fine-grained control over semantic content, such as emotion steering in natural language generation.

Noteworthy Papers

  • FoodMLLM-JP: Demonstrates superior performance in ingredient generation for Japanese recipes using multimodal large language models, outperforming current state-of-the-art models.
  • AVIATE: Introduces a novel strategy for improving traceability recovery in bilingual software projects, significantly enhancing Average Precision and Mean Average Precision.
  • Towards Robust Multimodal Sentiment Analysis with Incomplete Data: Proposes a Language-dominated Noise-resistant Learning Network (LNLN) that consistently outperforms existing baselines in multimodal sentiment analysis.
  • Beyond Single Concept Vector: Introduces Gaussian Concept Subspace (GCS) to improve the robustness and effectiveness of concept representations in large language models.
  • Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval: Proposes LECCR, a method that leverages multimodal large language models to enhance cross-lingual cross-modal retrieval, achieving state-of-the-art performance on multiple benchmarks.

Sources

FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

AVIATE: Exploiting Translation Variants of Artifacts to Improve IR-based Traceability Recovery in Bilingual Software Projects

Towards Robust Multimodal Sentiment Analysis with Incomplete Data

OM4OV: Leveraging Ontology Matching for Ontology Versioning

Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

Evaluating and explaining training strategies for zero-shot cross-lingual news sentiment analysis

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing

Advanced Arabic Alphabet Sign Language Recognition Using Transfer Learning and Transformer Models

Concept Space Alignment in Multilingual LLMs

Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP

EUFCC-CIR: a Composed Image Retrieval Dataset for GLAM Collections

Built with on top of