Efficient and Multimodal LLMs: Recent Trends

Advances in Efficient and Multimodal Large Language Models

Recent developments in the field of Large Language Models (LLMs) have focused on enhancing efficiency, reducing computational demands, and expanding capabilities through multimodal integration. Innovations in quantization techniques, such as asymmetric microscaling and bit-serial mixture-of-datatype, have significantly improved the accuracy and performance of 4-bit LLM inference, making it more robust and calibration-free. These advancements are crucial for deploying LLMs on resource-constrained devices, such as mobile phones, where models like SlimLM and BlueLM-V-3B have demonstrated efficient on-device processing capabilities.

The exploration of multimodal LLMs has opened new avenues for integrating language models with visual and other sensory data, enhancing their utility in everyday tasks. BlueLM-V-3B, in particular, showcases the potential of co-designing algorithms and systems to optimize model inference on mobile platforms, achieving high performance with minimal hardware requirements.

Additionally, the field has seen a shift towards evaluating the impact of quantization on code quality, emphasizing the need for careful scrutiny and validation of LLM-generated code. Studies like 'Precision or Peril' highlight the inconsistent effects of quantization on code quality and underscore the importance of continuous evaluation as LLMs evolve.

Noteworthy papers include:

  • AMXFP4: Introduces a novel data format that significantly outperforms existing quantization techniques, enabling robust 4-bit inference.
  • BlueLM-V-3B: Demonstrates efficient deployment of multimodal LLMs on mobile devices, achieving high performance with minimal hardware requirements.
  • Precision or Peril: Provides a comprehensive evaluation of the impact of quantization on code quality, emphasizing the need for careful scrutiny of LLM-generated code.

Sources

AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Does Prompt Formatting Have Any Impact on LLM Performance?

Generating Energy-efficient code with LLMs

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Precision or Peril: Evaluating Code Quality from Quantized Large Language Models

BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Bi-Mamba: Towards Accurate 1-Bit State Space Models

Green My LLM: Studying the key factors affecting the energy consumption of code assistants

Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Hymba: A Hybrid-head Architecture for Small Language Models

Built with on top of