Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning
Abstract
AdaMoE, a Mixture-of-Experts architecture, enhances VLA models by leveraging pretrained weights and improving computational efficiency, achieving superior performance in robotic manipulation tasks.
Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
Community
(1) We present an efficient approach to scale up VLA models. By inheriting weights from well-pretrained VLA foundation models, we extend them into MoE architectures at low cost with well-balanced experts.
(2) We introduce a novel MoE architecture specifically designed for VLA models. Through decoupling token selection from expert weighting, this architecture enables both effective load balancing and performance improvement.
(3) We demonstrate substantial performance improvements on established benchmarks, achieving 1.8% improvement over the $\pi_0$ baseline on LIBERO tasks and 9.3% success rate gain on 19 RoboTwin hard setting tasks. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation (2025)
- F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions (2025)
- UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning (2025)
- SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning (2025)
- Verifier-free Test-Time Sampling for Vision Language Action Models (2025)
- Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert (2025)
- The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper