RLDX-1-VLM

Paper  Â·  Project page  Â·  Code  Â·  Models

RLDX-1-VLM is the vision-language backbone used by the RLDX-1 robot policy family. It is a Qwen3-VL-8B-Instruct checkpoint distributed separately from the action policy so that researchers can inspect, finetune, or replace the perceptual stack independently.

Note. This checkpoint exposes a standard Qwen3-VL VLM interface only — it does not ship the Multi-Stream Action Transformer head, the cognition tokens, the memory / motion / physics modules, or the RLDX inference server. For action prediction, use one of the RLDX-1-PT, RLDX-1-FT-*, or RLDX-1-MT-* checkpoints.

Intended use

  • As the --backbone-path for finetuning a fresh RLDX-1 policy (recipe).
  • For VLM-only ablations, dense-captioning experiments, or perceptual probing within the RLDX research stack.

Quick start

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "RLWRLD/RLDX-1-VLM",
    torch_dtype="bfloat16",
    device_map="cuda:0",
)
processor = AutoProcessor.from_pretrained("RLWRLD/RLDX-1-VLM")

Model details

  • Type: Vision-language model (multimodal text + image / video).
  • Backbone: Qwen/Qwen3-VL-8B-Instruct.
  • Params: 8B.
  • Role in RLDX-1: perceptual encoder for the MSAT action policy. Cognition tokens are injected into this backbone and routed through Qwen3-VL hidden states to produce a compact perceptual summary consumed by the action model.

For a full architectural walkthrough including how cognition tokens are wired into this backbone, see docs/architecture.md.

Limitations

RLDX-1-VLM is a research backbone snapshot. It is not safety-tuned beyond its Qwen3-VL upstream, and it is not intended as a general-purpose chat assistant. For action prediction it must be paired with the RLDX-1 policy head; the standalone VLM does not produce robot commands.

Citation

@article{rldx2026,
  title={RLDX-1 Technical Report},
  author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
  year={2026},
  note={RLWRLD},
  eprint={2605.03269},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2605.03269}
}

License

Released under the RLWRLD Model License v1.0 — a non-commercial license with attribution and share-alike requirements. See LICENSE.md for the full text. By using this model you agree to those terms, including the use restrictions in §3.5.

Downloads last month
983
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RLWRLD/RLDX-1-VLM

Finetuned
(263)
this model

Collection including RLWRLD/RLDX-1-VLM

Paper for RLWRLD/RLDX-1-VLM