Advancing Vision-Language Models: Adaptability and Robustness in Few-Shot and Weakly Supervised Scenarios

The recent advancements in the research area predominantly revolve around enhancing the capabilities of vision-language models (VLMs), particularly CLIP, in various downstream tasks. A significant focus has been on improving the adaptability and robustness of these models in few-shot and weakly supervised scenarios. Innovations include the development of Bayesian inference-based adapters for better uncertainty estimation, the integration of semantic-aware representations to address multi-label recognition challenges, and the use of noise adaptation techniques to improve image denoising. Additionally, there is a growing interest in leveraging cross-modal interactions and prompt engineering to refine model performance in few-shot learning tasks. Notably, regularization methods and attention mechanisms are being explored to enhance the quality of semantic segmentation and classification tasks. These developments collectively push the boundaries of what VLMs can achieve, emphasizing the importance of fine-grained semantic understanding and robust model deployment in real-world applications.

Noteworthy papers include 'BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation,' which introduces a Bayesian approach for improved uncertainty estimates, and 'Text and Image Are Mutually Beneficial: enhancing training-free few-shot classification with CLIP,' which proposes a mutual guidance mechanism for better few-shot classification performance.

Sources

BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation

Label-template based Few-Shot Text Classification with Contrastive Learning

LAN: Learning to Adapt Noise for Image Denoising

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

Enhance Vision-Language Alignment with Noise

MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

LMM-Regularized CLIP Embeddings for Image Classification

CRoF: CLIP-based Robust Few-shot Learning on Noisy Labels

Modelling Multi-modal Cross-interaction for ML-FSIC Based on Local Feature Selection

Prompt Categories Cluster for Weakly Supervised Semantic Segmentation

Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models

I0T: Embedding Standardization Method Towards Zero Modality Gap

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

Built with on top of