RLinf: Reinforcement Learning Infrastructure for Agentic AI
RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description
This model is trained on gen-robot/openvla-7b-rlvla-warmup
by Group Relative Policy Optimization (GRPO) on the ManiSkill simulator.
Full OOD Evaluation and Results
Overall Eval Results
Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
Avg results | 0.7915 | 0.6064 | 0.7705 | 0.8193 | 0.7515 |
Training Setting Eval
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
Avg results | 0.9375 | 0.9414 | 0.9766 | 0.9609 | 0.8438 |
OOD Eval on Vision
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
vision avg | 0.8047 | 0.8469 | 0.9211 | 0.8203 | 0.7469 |
unseen table | 0.9063 | 0.9141 | 0.9648 | 0.9570 | 0.8984 |
dynamic texture (weak) | 0.8516 | 0.9102 | 0.9492 | 0.8555 | 0.7891 |
dynamic texture (strong) | 0.7500 | 0.7734 | 0.8633 | 0.7227 | 0.6563 |
dynamic noise (weak) | 0.8281 | 0.8945 | 0.9805 | 0.8711 | 0.7969 |
dynamic noise (strong) | 0.6875 | 0.7422 | 0.8477 | 0.6953 | 0.5938 |
OOD Eval on Semantic
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
object avg | 0.7500 | 0.4553 | 0.6484 | 0.7835 | 0.7299 |
unseen objects | 0.8281 | 0.8047 | 0.8594 | 0.8164 | 0.7656 |
unseen receptacles | 0.6875 | 0.7422 | 0.8750 | 0.8125 | 0.7344 |
unseen instructions | 0.8203 | 0.6797 | 0.7109 | 0.9453 | 0.8906 |
multi-object (both seen) | 0.7891 | 0.3516 | 0.6055 | 0.8438 | 0.7578 |
multi-object (both unseen) | 0.5703 | 0.3047 | 0.5508 | 0.6289 | 0.5781 |
distractive receptacle | 0.8047 | 0.1875 | 0.6133 | 0.8281 | 0.7813 |
multi-receptacle (both unseen) | 0.7500 | 0.3242 | 0.23828125 | 0.6094 | 0.6016 |
OOD Eval on Position
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
position avg | 0.8177 | 0.4466 | 0.7357 | 0.8542 | 0.7786 |
unseen position (object & receptacle) | 0.7344 | 0.4023 | 0.6992 | 0.8633 | 0.7500 |
unseen robot init pose | 0.8359 | 0.4805 | 0.7188 | 0.7773 | 0.7031 |
mid-episode object reposition | 0.8828 | 0.4570 | 0.7891 | 0.9212 | 0.8828 |
How to Use
Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_grpo_openvla.yaml
:
- Set
actor.checkpoint_load_path
,actor.tokenizer.tokenizer_model
, androllout.model_dir
to the path of the model checkpoint.
Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora
to false
.
License
This code repository and the model weights are licensed under the MIT License.
- Downloads last month
- 20
Model tree for RLinf/RLinf-OpenVLA-GRPO-ManiSkill3-25ood
Evaluation results
- accuracy on maniskill-trainself-reported84.380
- accuracy on maniskill-visionself-reported74.690
- accuracy on maniskill-semanticself-reported72.990
- accuracy on maniskill-positionself-reported77.860