Vision-Language Models and Foundation Models: Remote Sensing, Digital Oncology, and Computational Pathology

Report on Current Developments in the Research Area

General Direction of the Field

The recent advancements in the research area are primarily focused on enhancing the capabilities of Vision-Language Models (VLMs) and foundation models, particularly in specialized domains such as remote sensing, digital oncology, and computational pathology. The field is moving towards more efficient and robust models that can generalize well across different tasks and datasets, even in the presence of limited labeled data or noisy conditions.

  1. Integration of Vision-Language Models in Remote Sensing:

    • There is a significant shift towards improving the zero-shot classification capabilities of VLMs in remote sensing. Traditional methods that rely on inductive inference by dividing images into patches are being replaced by transductive inference approaches that leverage contextual information more effectively. This shift aims to enhance the model's ability to make accurate predictions without requiring extensive supervision or significant computational overhead.
  2. Foundation Models in Digital Oncology:

    • The development of foundation models in digital oncology is progressing towards more efficient use of computational resources and better performance on diverse clinical tasks. Models like CanvOI are exploring novel architectural modifications, such as larger tile sizes and smaller patch sizes, to optimize performance and achieve state-of-the-art results on cancer-related benchmarks. These models are also demonstrating improved performance when trained on smaller datasets, indicating their potential to overcome data scarcity challenges in the biomedical field.
  3. Supervised Foundation Models in Computational Pathology:

    • The field is witnessing a move towards supervised training methods for foundation models in computational pathology, which aim to reduce the high costs associated with traditional training approaches. Multi-task learning is being employed to train joint encoders that can capture the properties of tissue samples more effectively. These models are showing comparable or superior performance to self-supervised models while requiring significantly less training data, making them more practical for real-world applications.
  4. Transformer Models in On-board Satellite Image Classification:

    • The adoption of Transformer-based architectures in remote sensing image classification is gaining traction, particularly for on-board satellite processing. Pre-trained Transformer models are outperforming traditional CNN-based models in terms of accuracy, computational efficiency, and robustness against noisy data. The focus is on identifying models that can deliver high performance with reduced computational requirements, making them suitable for real-time satellite operations.

Noteworthy Papers

  • Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification:

    • Introduces a transductive inference approach that significantly improves zero-shot classification accuracy in remote sensing by leveraging contextual information.
  • CanvOI, an Oncology Intelligence Foundation Model:

    • Demonstrates a novel approach to optimizing foundation models for digital pathology, achieving state-of-the-art performance with reduced computational resources.
  • Tissue Concepts: supervised foundation models in computational pathology:

    • Proposes a supervised training method that significantly reduces the cost of training foundation models in computational pathology, achieving comparable performance to self-supervised models with much less data.
  • On-board Satellite Image Classification for Earth Observation:

    • Identifies EfficientViT-M2 as the optimal model for on-board satellite image classification, offering high accuracy, efficiency, and robustness against noisy data.

Sources

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS Differently

Tissue Concepts: supervised foundation models in computational pathology

On-board Satellite Image Classification for Earth Observation: A Comparative Study of Pre-Trained Vision Transformer Models