Large Pre-Trained and Multi-Modal Models: Efficiency, Scalability, and Performance

Report on Current Developments in the Research Area

General Direction of the Field

The recent advancements in the research area are primarily focused on enhancing the efficiency, scalability, and performance of large pre-trained models (LPMs) and multi-modal models, particularly in the context of natural language processing (NLP) and computer vision. The field is moving towards more parameter-efficient fine-tuning methods, dynamic and scalable model architectures, and innovative techniques to mitigate common issues such as hallucination in visual-language models and the high computational costs associated with large models.

  1. Parameter-Efficient Fine-Tuning (PEFT): There is a significant shift towards developing PEFT methods that not only reduce the memory footprint but also improve the generalizability and efficiency of fine-tuning processes. These methods leverage mathematical techniques like singular value decomposition (SVD) to initialize low-rank matrices, thereby optimizing the starting points for gradient descent and enhancing the adaptability of pre-trained models to new tasks.

  2. Multi-Modal Model Alignment and Fusion: The integration of knowledge from uni-modal models into cohesive multi-modal representations is a growing area of interest. Recent approaches focus on creating unified multi-directional connectors that facilitate the fusion of specialized expertise from various uni-modal models, enabling efficient scaling to new tasks and modalities without altering the model architecture.

  3. Scalable and Efficient Model Pruning: Pruning methodologies are being refined to address the computational challenges posed by large models, particularly Mixture-of-Experts (MoEs). The field is exploring structured pruning techniques that can precede unstructured pruning, leading to superior performance with reduced computational complexity. These methods aim to leverage latent structures between experts to achieve scalable and efficient pruning.

  4. Mitigating Hallucination in Visual-Language Models: There is a concerted effort to address the issue of hallucination in visual-language models by re-balancing the attention distribution between textual and visual tokens. New decoding methods are being developed to reduce textual bias and enhance the model's reliance on visual information, thereby improving the accuracy and reliability of multimodal outputs.

  5. Efficient Diffusion Model Fine-Tuning: The fine-tuning of diffusion models is being optimized to make better use of temporarily ineffective parameters, enabling the adaptation of pre-trained models to new tasks without overfitting. These methods incorporate sparse low-rank adaptation and progressive parameter adjustment strategies to enhance the generative capabilities of pre-trained models while maintaining their generalization ability.

  6. Dynamic Expert Allocation in MoE Models: The allocation of experts in MoE models is becoming more dynamic, with new router mechanisms being developed to determine the optimal number of experts based on token importance. This approach aims to improve the model's performance on benchmark datasets by ensuring that the most relevant experts are activated for each input token.

  7. Contrastive MultiModal Learning: The alignment of multimodal representations is being re-examined to capture not only shared information but also synergistic and unique interactions between modalities. New learning strategies are being proposed to maximize mutual information between augmented versions of multimodal features, leading to more comprehensive and accurate multimodal interactions.

  8. Efficient Quantization of Diffusion Transformers: The deployment of diffusion transformers on resource-constrained devices is being facilitated by new quantization methods that reduce the impact of channel-wise outliers in input activations. These methods employ temporal-aggregated smoothing techniques and layer-wise grid search strategies to enable low-bitwidth quantization without compromising model performance.

Noteworthy Papers

  • SVFit: Introduces a novel PEFT approach using SVD to initialize low-rank matrices, outperforming LoRA with significantly fewer trainable parameters.
  • Alt-MoE: Proposes a unified multi-directional connector for multi-modal alignment, efficiently scaling to new tasks and modalities.
  • STUN: Demonstrates that structured expert pruning can precede unstructured pruning, achieving superior performance with scalable complexity.
  • RBD: Mitigates hallucination in VLMs by re-balancing attention distribution, outperforming existing methods on hallucination metrics.
  • SaRA: Enhances diffusion model fine-tuning with sparse low-rank adaptation, maintaining generalization ability while improving generative capabilities.
  • DA-MoE: Introduces a dynamic router mechanism for MoE models, consistently outperforming state-of-the-art models on benchmark datasets.
  • CoMM: Captures redundant, unique, and synergistic information in multimodal interactions, achieving state-of-the-art results on multimodal benchmarks.
  • DiTAS: Enables efficient quantization of diffusion transformers, maintaining performance with low-bitwidth quantization on resource-constrained devices.

Sources

SVFit: Parameter-Efficient Fine-Tuning of Large Pre-Trained Models Using Singular Values

Alt-MoE: Multimodal Alignment via Alternating Optimization of Multi-directional MoE with Unimodal Models

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

What to align in multimodal contrastive learning?

DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing