The recent advancements in the field of large language models (LLMs) and mixture-of-experts (MoE) architectures have shown significant progress in optimizing computational efficiency and model performance. Researchers are increasingly focusing on pruning techniques that reduce model size without compromising accuracy, leveraging insights from routing policies and input activations. These methods, often one-shot and requiring no retraining, are proving to be highly effective in maintaining model performance even at high levels of sparsity. Additionally, novel scheduling and parallelism strategies are being developed to enhance the throughput of MoE models during inference, addressing the computational bottlenecks introduced by large parameters and communication overheads. Furthermore, adaptive switching mechanisms between small and large agents are being explored to optimize the use of cloud-based and local-deployed LLMs, enhancing both performance and efficiency. In the realm of graph neural networks (GNNs), there is a growing emphasis on developing robust and balanced data pruning methods that can handle large-scale, imbalanced, and noisy datasets, ensuring both efficiency and reliability in on-device deployment.
Noteworthy papers include: MoE-Pruner which introduces a one-shot pruning method that significantly outperforms state-of-the-art LLM pruning methods, and EPS-MoE which demonstrates an average 21% improvement in prefill throughput over existing parallel inference methods.