Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Abstract
Vlaser, a Vision-Language-Action Model, integrates high-level reasoning with low-level control for embodied agents, achieving state-of-the-art performance in embodied reasoning tasks and competitive results in robot benchmarks.
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
Community
Hi, everyone, please see our latest paper: Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning, which achieves top-tier results on embodied reasoning capability and discusses the transfer learning from VLMs to VLAs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control (2025)
- Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation (2025)
- UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning (2025)
- Igniting VLMs toward the Embodied Space (2025)
- F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions (2025)
- VLA-R1: Enhancing Reasoning in Vision-Language-Action Models (2025)
- VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper