Optimizing Computational Efficiency in LLMs and MoE Architectures

The recent advancements in the field of large language models (LLMs) and mixture-of-experts (MoE) architectures have shown significant progress in optimizing computational efficiency and model performance. Researchers are increasingly focusing on pruning techniques that reduce model size without compromising accuracy, leveraging insights from routing policies and input activations. These methods, often one-shot and requiring no retraining, are proving to be highly effective in maintaining model performance even at high levels of sparsity. Additionally, novel scheduling and parallelism strategies are being developed to enhance the throughput of MoE models during inference, addressing the computational bottlenecks introduced by large parameters and communication overheads. Furthermore, adaptive switching mechanisms between small and large agents are being explored to optimize the use of cloud-based and local-deployed LLMs, enhancing both performance and efficiency. In the realm of graph neural networks (GNNs), there is a growing emphasis on developing robust and balanced data pruning methods that can handle large-scale, imbalanced, and noisy datasets, ensuring both efficiency and reliability in on-device deployment.

Noteworthy papers include: MoE-Pruner which introduces a one-shot pruning method that significantly outperforms state-of-the-art LLM pruning methods, and EPS-MoE which demonstrates an average 21% improvement in prefill throughput over existing parallel inference methods.

Sources

MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

GDeR: Safeguarding Efficiency, Balancing, and Robustness via Prototypical Graph Pruning

Built with on top of