Abstract
Router-weighted Expert Activation Merging (REAM) is proposed as a novel method for reducing memory requirements in Mixture-of-Experts large language models by grouping and merging expert weights instead of pruning them, achieving performance comparable to uncompressed models while maintaining efficiency.
Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.
Community
Excited to share the REAM paper, the full code and GLM-4.5-Air-REAM is coming in couple days!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AIMER: Calibration-Free Task-Agnostic MoE Pruning (2026)
- EvoESAP: Non-Uniform Expert Pruning for Sparse MoE (2026)
- Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression (2026)
- LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing (2026)
- MoE-Spec: Expert Budgeting for Efficient Speculative Decoding (2026)
- CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging (2026)
- Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
I'm glad this finally released!
It'd be cool to see the method on more recent models.
Get this paper in your agent:
hf papers read 2604.04356 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper