RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description

This model is trained on gen-robot/openvla-7b-rlvla-warmup by Group Relative Policy Optimization (GRPO) on the ManiSkill simulator.

Full OOD Evaluation and Results

Overall Eval Results

Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
Avg results	0.7915	0.6064	0.7705	0.8193	0.7515

Training Setting Eval

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
Avg results	0.9375	0.9414	0.9766	0.9609	0.8438

OOD Eval on Vision

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
vision avg	0.8047	0.8469	0.9211	0.8203	0.7469
unseen table	0.9063	0.9141	0.9648	0.9570	0.8984
dynamic texture (weak)	0.8516	0.9102	0.9492	0.8555	0.7891
dynamic texture (strong)	0.7500	0.7734	0.8633	0.7227	0.6563
dynamic noise (weak)	0.8281	0.8945	0.9805	0.8711	0.7969
dynamic noise (strong)	0.6875	0.7422	0.8477	0.6953	0.5938

OOD Eval on Semantic

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
object avg	0.7500	0.4553	0.6484	0.7835	0.7299
unseen objects	0.8281	0.8047	0.8594	0.8164	0.7656
unseen receptacles	0.6875	0.7422	0.8750	0.8125	0.7344
unseen instructions	0.8203	0.6797	0.7109	0.9453	0.8906
multi-object (both seen)	0.7891	0.3516	0.6055	0.8438	0.7578
multi-object (both unseen)	0.5703	0.3047	0.5508	0.6289	0.5781
distractive receptacle	0.8047	0.1875	0.6133	0.8281	0.7813
multi-receptacle (both unseen)	0.7500	0.3242	0.23828125	0.6094	0.6016

OOD Eval on Position

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
position avg	0.8177	0.4466	0.7357	0.8542	0.7786
unseen position (object & receptacle)	0.7344	0.4023	0.6992	0.8633	0.7500
unseen robot init pose	0.8359	0.4805	0.7188	0.7773	0.7031
mid-episode object reposition	0.8828	0.4570	0.7891	0.9212	0.8828

How to Use

Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_grpo_openvla.yaml: