Advances in Multi-Modal Perception and Aesthetic Assessment

The research landscape in multi-modal perceptual metrics and aesthetic assessment is witnessing significant advancements, driven by the development of more sophisticated models and benchmarks. A notable trend is the shift towards unified and multi-task models that aim to capture the complexity of human perception across various modalities. These models, often leveraging large multi-modal language models (LMMs), are being fine-tuned on specialized tasks to improve performance, though challenges remain in generalizing to unseen tasks. Additionally, there is a growing emphasis on self-supervised learning techniques that utilize vast amounts of unlabeled data to enhance model capabilities, particularly in image aesthetic assessment. Innovations in patch selection and embedding-based refinement are also contributing to more accurate and efficient image quality assessment, especially for 360-degree images. Furthermore, the field is making strides in automating the evaluation of image transcreation, with new metrics designed to measure cultural relevance, semantic equivalence, and visual similarity. These developments collectively push the boundaries of what is possible in understanding and evaluating visual and textual content, though there is still room for improvement in terms of robustness and generalization. Notably, papers introducing UniSim-Bench and the comprehensive aesthetic MLLM stand out for their contributions to advancing unified benchmarks and nuanced aesthetic insight, respectively.

Sources

Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

A Two-Fold Patch Selection Approach for Improved 360-Degree Image Quality Assessment

Towards Automatic Evaluation for Image Transcreation

What makes a good metric? Evaluating automatic metrics for text-to-image consistency

HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model

Built with on top of