RLinf-logo

RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

RLinf-overview

Model Description

This model is trained on gen-robot/openvla-7b-rlvla-warmup by Group Relative Policy Optimization (GRPO) on the ManiSkill simulator.

Full OOD Evaluation and Results

Overall Eval Results

Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
Avg results 0.7915 0.6064 0.7705 0.8193 0.7515

Training Setting Eval

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
Avg results 0.9375 0.9414 0.9766 0.9609 0.8438

OOD Eval on Vision

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
vision avg 0.8047 0.8469 0.9211 0.8203 0.7469
unseen table 0.9063 0.9141 0.9648 0.9570 0.8984
dynamic texture (weak) 0.8516 0.9102 0.9492 0.8555 0.7891
dynamic texture (strong) 0.7500 0.7734 0.8633 0.7227 0.6563
dynamic noise (weak) 0.8281 0.8945 0.9805 0.8711 0.7969
dynamic noise (strong) 0.6875 0.7422 0.8477 0.6953 0.5938

OOD Eval on Semantic

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
object avg 0.7500 0.4553 0.6484 0.7835 0.7299
unseen objects 0.8281 0.8047 0.8594 0.8164 0.7656
unseen receptacles 0.6875 0.7422 0.8750 0.8125 0.7344
unseen instructions 0.8203 0.6797 0.7109 0.9453 0.8906
multi-object (both seen) 0.7891 0.3516 0.6055 0.8438 0.7578
multi-object (both unseen) 0.5703 0.3047 0.5508 0.6289 0.5781
distractive receptacle 0.8047 0.1875 0.6133 0.8281 0.7813
multi-receptacle (both unseen) 0.7500 0.3242 0.23828125 0.6094 0.6016

OOD Eval on Position

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
position avg 0.8177 0.4466 0.7357 0.8542 0.7786
unseen position (object & receptacle) 0.7344 0.4023 0.6992 0.8633 0.7500
unseen robot init pose 0.8359 0.4805 0.7188 0.7773 0.7031
mid-episode object reposition 0.8828 0.4570 0.7891 0.9212 0.8828

How to Use

Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_grpo_openvla.yaml:

  • Set actor.checkpoint_load_path, actor.tokenizer.tokenizer_model, and rollout.model_dir to the path of the model checkpoint.

Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.

License

This code repository and the model weights are licensed under the MIT License.

Downloads last month
20
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Model tree for RLinf/RLinf-OpenVLA-GRPO-ManiSkill3-25ood

Base model

openvla/openvla-7b
Finetuned
(4)
this model

Evaluation results