FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach
Abstract
As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.
Community
To break the performance ceiling of fine-grained MoE designs that are solely confined to the intermediate dimension, which has been revealed by the scaling law of MoE, we introduce the FineRMoE (FineR-grained MoE) architecture. It pioneers the expansion of the fine-grained expert design in MoE models from only the intermediate dimension to the output dimension, aiming to enhance expert specialization beyond the single-dimension limit. The core contributions of this work include:
- Finer-grained expert design across intermediate and output dimensions;
- Bi-level sparse forward computation paradigm for multi-expert fusion;
- Unified routing mechanism with one router governing two sparse layers;
- Generalized upcycling compatible with FineRMoE and conventional MoEs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging (2026)
- Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation (2026)
- SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models (2026)
- LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing (2026)
- Scaling Embeddings Outperforms Scaling Experts in Language Models (2026)
- TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders (2026)
- Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper