Enhancing Adaptability and Robustness in Vision-Language Models

The recent developments in the research area of vision-language models and prompt learning have shown significant advancements in zero-shot and few-shot learning capabilities. Researchers are focusing on enhancing the robustness and adaptability of these models by integrating novel techniques such as diffusion models, vector quantization, and hierarchical language structures. These innovations aim to improve the generalization of models across diverse datasets and domains, addressing challenges such as domain shifts, catastrophic forgetting, and adversarial robustness. Notably, the use of large language models and vision-language embeddings for guiding prompt learning and domain adaptation has shown promising results in tasks like human-object interaction detection and continual learning. Additionally, the exploration of uncertainty estimation in machine learning interatomic potentials highlights the importance of quantifying model errors for active learning strategies. Overall, the field is moving towards more sophisticated and adaptable models that can handle complex, real-world scenarios without extensive labeled data.

Noteworthy Papers:

  • Frolic: Introduces a label-free prompt distribution learning and bias correction framework that significantly boosts zero-shot performance without labeled data.
  • ADD: Proposes an adversarial environment design algorithm using regret-guided diffusion models to enhance agent robustness in deep reinforcement learning.
  • DIFFUSIONHOI: Utilizes text-to-image diffusion models for human-object interaction detection, achieving state-of-the-art performance in both regular and zero-shot setups.

Sources

Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting

Prompting Continual Person Search

Adversarial Environment Design via Regret-Guided Diffusion Models

Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

Prompt Diffusion Robustifies Any-Modality Prompt Learning

Historical Test-time Prompt Tuning for Vision Foundation Models

Open-Vocabulary Object Detection via Language Hierarchy

Evaluation of uncertainty estimations for Gaussian process regression based machine learning interatomic potentials

Vector Quantization Prompting for Continual Learning

Referring Human Pose and Mask Estimation in the Wild

Domain Adaptation with a Single Vision-Language Embedding

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

Active Learning for Vision-Language Models

Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier

GRADE: Quantifying Sample Diversity in Text-to-Image Models

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking

Bayesian-guided Label Mapping for Visual Reprogramming

Built with on top of