Advancements in Mixture of Experts for Efficient Large Language Models

The field of large language models is rapidly advancing with a focus on improving efficiency and performance. Recent developments have centered around the Mixture of Experts (MoE) paradigm, which enables the selective activation of parameter subsets for each input token. This approach has shown great promise in reducing computational costs while maintaining model accuracy. Researchers are exploring various innovations, including novel routing mechanisms, sparse expertise allocation, and decentralized learning strategies, to further enhance the efficiency and scalability of MoE models. Notably, the integration of MoE with other techniques, such as quantization and metasurface-enabled wireless communication, is also being investigated. These advancements have the potential to significantly impact the deployment of large language models in real-world applications. Noteworthy papers include: USMoE, which proposes a unified competitive learning framework to improve the performance of existing SMoEs. S2MoE, which introduces a robust sparse mixture of experts via stochastic learning to mitigate representation collapse. MiLo, which augments highly quantized MoEs with a mixture of low-rank compensators to recover accuracy loss from extreme quantization.

Sources

Sparse Mixture of Experts as Unified Competitive Learning

S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Mixture of Routers

Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models

Over-the-Air Edge Inference via End-to-End Metasurfaces-Integrated Artificial Neural Networks

CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures

DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism

Mixture-of-Experts for Distributed Edge Computing with Channel-Aware Gating Function

IRS Assisted Decentralized Learning for Wideband Spectrum Sensing

Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators

Built with on top of