VLAdaptorBench

This repository contains the benchmark setup, metric code, debug history, and validation artifacts for the proposed VLA + adaptor label study on bimanual_take_tray_out_of_oven.

This is still a label-validation repository, not a policy repository. No pi0.5 integration is included here.

Current Status

The latest work behind this upload produced:

  • metric_iter30_full100_single_pass_full_logging_fixed_templates_merged
    • merged 100-episode dense/fuller-logging result tree from the single-pass fixed-template run

The current Hub upload includes:

  • artifacts/results/metric_iter31_sample10_all_metrics_verify/
    • compact 10-episode verification subset with all_metrics GIFs only
  • the fast all_metrics-only render path in:
    • code/scripts/render_oven_metric_frame.py
    • code/scripts/render_oven_metric_gifs.py

The new sample verification bundle is meant to be the quickest remote sanity-check entry point. It includes the sampled dense/keyframe tables, per-episode metrics, fuller debug sidecars, fixed templates, selection metadata, and one compact full-metrics GIF per sampled episode.

The earlier metric_iter29_ep0_single_pass_full_logging_fixed_templates validation pass for episode 0 remains the detailed single-episode reference for the fuller debug logging and the debug-aware GIF renderer.

That run keeps the trusted iter24 template bundle fixed, adds the fuller dense/debug logging in a single pass, and regenerates the episode-0 visualization suite from the richer artifact. It is the current reference for:

  • the episode0.debug.jsonl sidecar with per-frame p_pre and p_ext internals
  • the single-pass dense CSV with fuller logged sub-metrics
  • the updated path_quality_focus GIF that now exposes the p_ext milestone search, milestone scores, and planner outcomes directly in the visualization

The earlier metric_iter24_*_door_contact_geom reruns for episodes 0 and 1 remain the trusted baseline for the repaired oven metrics.

That rerun fixes the main simulator-state bugs that were still contaminating the oven metrics:

  1. The reveal-to-retrieve transition used to occur too late, effectively at grasp time.
  2. The visibility metric used to drop to zero around frame 232 even when the tray grasp region was clearly visible in wrist_left.
  3. p_pre stayed near zero until grasp.
  4. Extraction labels could flicker or drift because oracle rollouts were not restoring the simulator state exactly.
  5. The old dense runner's restore-heavy path could still bias later frames after an oracle call.

The current code addresses those issues by:

  • decoding RLBench mask PNGs correctly before converting them back to simulator handles
  • scoring visibility directly from mask-handle agreement instead of the old depth/z heuristic
  • inferring tray mask handles from grasp-region projections
  • deriving a late-window pregrasp approach template instead of accidentally including frame-8 arm poses
  • adding explicit pregrasp_progress, pregrasp_distance, pregrasp_speed, and phase_score
  • making the repair path batch frames sequentially per worker so late-frame rows do not drift
  • snapshotting and restoring exact arm joints, gripper joints, and the full grasped-object subtree
  • supporting and now preferring --independent-replay for the authoritative dense study
  • tightening y_pre so it stays on once the retriever is clearly inside the pregrasp corridor
  • retuning phase_score so it tracks the reveal-to-retrieve handoff instead of generic early motion
  • recomputing intervention validity from isolated per-frame env replays instead of the old live-cache path
  • sampling intervention states earlier in the reveal phase so pre-ready extract checks are not contaminated by borderline near-ready states
  • confirming extraction feasibility with repeated planner checks inside the extract oracle so one lucky planner sample is less likely to flip a label

The old iter4_*, iter6_*, iter19_*, and iter22_* outputs are still useful historical checkpoints, but the current authoritative outputs are:

  • artifacts/results/metric_iter24_ep0_door_contact_geom/
  • artifacts/results/metric_iter24_ep1_door_contact_geom/
  • artifacts/results/metric_iter29_ep0_single_pass_full_logging_fixed_templates/

The main new fix in iter24 is the assisted-door contact scoring inside p_pre:

  • the old ignore_collisions=True branch treated oven-door contact as name-whitelisted and only checked the final door angle change
  • the new scorer traces door contacts step-by-step, estimates the local door-surface normal from simulator geometry, scores whether the retriever is sliding along the door or pushing it open, and penalizes direct head-on contact or door-closing motion
  • this specifically removes the false closed-door p_pre spike in episode 0 around frames 43-56 without collapsing the later pregrasp rise once the door is actually opening

The current repo state should therefore be treated as the repaired benchmark snapshot with geometry-aware door assistance, not the final metric design.

Brief caveat: the current y_ready label still gates on low oven-door angular speed after extraction feasibility persists. In this task, the retriever arm can legitimately nudge the door while already committing to retrieval, so y_ready can still switch later than the true reveal-to-retrieve boundary. For the current oven benchmark, y_ready should therefore not be treated as a decisive validation metric or a trusted phase-switch target.

The oven task also has a highly structured reveal-to-retrieve handoff in the expert demos: both arms reposition, the revealer opens and clears the door, then the retriever commits. Because that phase pattern is so standardized, good results on this task are most useful as a task-specific smoke test or a "does the adaptor beat a base finetune here?" check, not as strong evidence of general reveal-and-retrieve reasoning.

What Is In This Upload

  • code/rr_label_study/
    • Core metric code.
    • Dense replay, visibility scoring, pregrasp/extraction oracles, keyframe extraction, intervention checks, and summary metric computation.
  • code/scripts/
    • Study runners and helpers.
    • run_oven_label_study.py: dense/keyframe study runner.
    • launch_parallel_oven_label_study.py: multi-display worker launcher.
    • recompute_oven_pregrasp_parallel.py: targeted dense rerun for repaired p_pre labels.
    • run_oven_pregrasp_batch.py: sequential per-worker pregrasp recomputation helper.
    • refresh_saved_oven_study.py: recompute keyframes, per-episode metrics, intervention stats, and summary JSONs from saved dense CSVs after metric-code changes.
    • run_oven_single_frame.py: single-frame recomputation helper.
    • run_oven_frame_batch.py: new sequential batch recomputation helper used to avoid late-frame drift.
    • repair_oven_episode_dense.py: batched repair pass for suspicious dense rows.
    • render_oven_metric_frame.py: per-frame visualization renderer.
    • render_oven_metric_gifs.py: GIF renderer.
    • The visualization renderer now accepts either legacy templates.pkl files or the newer authoritative templates.json bundles.
  • artifacts/results/
    • Full debug history, including stale runs and current validation outputs.
  • runtime_assets/
    • Archived runtime assets needed to recreate this setup on another machine.
    • Includes the local oven-task dataset snapshot and the local coppelia_sim extraction used on this machine.
  • snapshots/
    • Compressed snapshot archives of large local payloads that were uploaded as bundles instead of expanded repo trees.
    • In particular, snapshots/VLAdaptorBench_upload_repo_payload_20260408.tar.gz contains the broader local upload tree used on this machine.
  • environment/
    • Machine snapshot, env export, pip freeze, setup helpers, and dataset notes.
  • external/
    • Local source snapshots of RLBench, PyRep, PerAct bimanual, and YARR used for this work.
  • MANIFEST.txt
    • Flat file listing of the upload contents.

Latest Metric Fixes

The latest code changes are in:

  • code/rr_label_study/oven_study.py
  • code/scripts/recompute_oven_pregrasp_parallel.py
  • code/scripts/run_oven_pregrasp_batch.py
  • code/scripts/repair_oven_episode_dense.py
  • code/scripts/run_oven_frame_batch.py
  • code/scripts/render_oven_metric_frame.py

The important changes are:

1. Visibility metric repair

  • _load_mask() now rescales stored mask PNGs back to [0, 1] before calling rgb_handles_to_mask.
  • Visibility is now computed by projecting grasp-region or whole-tray points into each camera and checking whether the decoded mask handle at the projected pixel matches the inferred tray handles.
  • Template derivation now infers mask_handle_ids from reference frames near the actual pregrasp/grasp window.

This fixes the old failure where visibility dropped to zero even when the tray lip was visibly present in the wrist camera.

2. Pregrasp/path metric repair

  • Template extraction now detects the pregrasp approach onset in a bounded late window before grasp instead of taking the first small negative slope in the entire episode.
  • The current template approach frames for episode 0 are now:
    • 177, 187, 197, 208, 218, 229, 232
  • p_pre now uses the last few approach templates plus explicit geometric progress toward the pregrasp pose instead of only brittle planner success.
  • y_pre now treats "already inside the pregrasp corridor" as success, which is appropriate for this oracle study.
  • The assisted pregrasp branch no longer treats oven-door collisions as a binary whitelist:
    • it traces per-step door contacts under ignore_collisions=True
    • estimates a local door-surface normal from the contacted simulator shape
    • rewards tangential or door-opening contact
    • penalizes head-on or door-closing contact
    • requires a minimum geometry-aware door-contact quality before assisted p_pre credit is given

3. Replay/repair correctness

  • The old isolated repair path replayed every suspicious frame from a fresh reset, which could corrupt late rows.
  • The new helper run_oven_frame_batch.py computes frame rows sequentially inside a single env per worker.
  • repair_oven_episode_dense.py now distributes frame batches, not individual frames, across displays.
  • SimulatorSnapshot now restores:
    • arm joint trees and explicit joint positions
    • gripper joint trees and explicit joint positions
    • the full subtree under any grasped object
    • grasp attachments with the original release parent
  • ReplayCache now keeps retrying stable grasp attachment while the demo gripper remains closed.

This fixed the major replay bug where post-oracle restores could leave the arm, gripper, or grasped tray in a subtly different state than the true demo frame.

4. Earlier phase signal

  • The code now records:
    • pregrasp_progress
    • pregrasp_distance
    • pregrasp_speed
    • phase_score
  • phase_score is now dominated by actual approach progress and p_pre, with a stricter threshold (0.5) so it no longer flips during the early reveal phase.
  • y_retrieve is still oracle-like and monotone, but the metric side now has a cleaner approach-sensitive signal for early switching.

5. Independent replay

  • run_oven_label_study.py already exposed --independent-replay.
  • launch_parallel_oven_label_study.py now passes that flag through to worker runs.
  • For the current oven study, independent replay is the trustworthy dense mode because it avoids cross-frame contamination from oracle rollouts.

6. Intervention validity repair

  • The old intervention summary reused the dense-study replay cache, which could still corrupt post-ready extract checks.
  • _interventional_validity() now evaluates each sampled intervention state from a fresh env/replay instance.
  • refresh_saved_oven_study.py now supports --dataset-root so intervention metrics can be recomputed instead of copied forward from stale JSON.
  • The refined intervention protocol now samples pre-ready states at ready_onset-20 and ready_onset-10 instead of ready_onset-10 and ready_onset-5, which avoids counting borderline almost-ready states as generic reveal-phase interventions.

7. Extraction-oracle hardening

  • _extract_score_and_success() now uses repeated planner checks before marking a milestone as feasible.
  • The current configuration is intentionally modest:
    • DEFAULT_PLAN_ATTEMPTS = 2
    • DEFAULT_PLAN_MIN_SUCCESSES = 2
  • This only hardens the extraction oracle, not the pregrasp score, so the dense study remains tractable while the noisy pre-ready extract successes are suppressed.

Latest Validated Artifacts

The current trustworthy artifacts are:

  • artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.dense.csv

  • artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.keyframes.csv

  • artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.metrics.json

  • artifacts/results/metric_iter24_ep0_door_contact_geom/summary.json

  • artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_all_metrics.gif

  • artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_visibility_focus.gif

  • artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_path_quality_focus.gif

  • artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.dense.csv

  • artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.keyframes.csv

  • artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.metrics.json

  • artifacts/results/metric_iter24_ep1_door_contact_geom/summary.json

  • artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_all_metrics.gif

  • artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_visibility_focus.gif

  • artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_path_quality_focus.gif

  • artifacts/results/oven_episode0_iter4_templates/templates.json

  • artifacts/results/oven_episode0_iter4_templates/templates.pkl

  • artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv

  • artifacts/results/oven_episode0_iter4_batch/frames/

  • artifacts/results/oven_episode0_iter4_clean/iter4_targeted_comparison.csv

  • artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv

  • artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png

  • artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv

  • artifacts/results/oven_episode0_iter6_independent_full/episode0.keyframes.csv

  • artifacts/results/oven_episode0_iter6_independent_full/episode0.metrics.json

  • artifacts/results/oven_episode0_iter6_independent_full/summary.json

  • artifacts/results/oven_episode0_iter6_visual_checks/early_visibility_contact_sheet.png

  • artifacts/results/oven_episode0_iter16_gif_suite/episode0.dense.csv

  • artifacts/results/oven_episode0_iter16_gif_suite/episode0.metrics.json

  • artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_all_metrics.gif

  • artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_visibility_focus.gif

  • artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_path_quality_focus.gif

  • artifacts/results/manual_metric_checks/episode0_frame210_visibility.png

  • artifacts/results/manual_metric_checks/episode0_frame232_visibility.png

  • artifacts/results/manual_metric_checks/episode0_frame210_path.png

  • artifacts/results/manual_metric_checks/episode6_frame230_path.png

  • artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_summary.json

  • artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_workers.json

The iter6_independent_full CSVs and JSON summaries have been refreshed with the latest phase_score logic via code/scripts/refresh_saved_oven_study.py.

Key Verified Findings

From the current independent-replay validation on episode 0:

  • Visibility over the dense 170-234 window is clean:
    • min three_view_visibility = 1.0
    • min full_view_visibility = 1.0
  • Pregrasp progress now rises well before grasp and stays predictive:
    • frame 210: pregrasp_progress β‰ˆ 0.451, p_pre β‰ˆ 0.185, y_pre = 0
    • frame 215: pregrasp_progress β‰ˆ 0.568, p_pre β‰ˆ 0.375, y_pre = 1
    • frame 220: pregrasp_progress β‰ˆ 0.702, p_pre β‰ˆ 0.496, y_pre = 1
    • frame 225: pregrasp_progress β‰ˆ 0.847, p_pre β‰ˆ 0.559, y_pre = 1
    • frame 230: pregrasp_progress β‰ˆ 0.950, p_pre β‰ˆ 0.654, y_pre = 1
  • Extraction feasibility is now separated from pregrasp:
    • frame 230: p_ext β‰ˆ 0.0007, y_ext = 0
    • frame 232: p_ext = 1.0, y_ext = 1
    • frame 234: p_ext = 1.0, y_ext = 1
  • In the refreshed full independent episode-0 run:
    • ppre_cross_frame = 216
    • pext_cross_frame = 232
    • phase_cross_frame = 214
    • retrieve_cross_frame = 215
    • ready_cross_frame = 234
    • single_switch_rate = 1.0
    • reversion_rate = 0.0
    • auroc_ppre_ypre β‰ˆ 0.761
    • auprc_ppre_ypre β‰ˆ 0.903
    • auroc_pext_yext = 1.0
    • auprc_pext_yext = 1.0
    • auroc_phase_yretrieve = 1.0
    • auprc_phase_yretrieve = 1.0
    • f1_phase_yretrieve β‰ˆ 0.996
    • auroc_phase_yready β‰ˆ 0.998
    • f1_phase_yready β‰ˆ 0.905
  • In the refreshed isolated intervention check on episode 0:
    • pre-ready open_more increases p_ext on 2/2 sampled states
    • pre-ready extract succeeds on 0/2
    • post-ready extract succeeds on 2/2
    • post-ready open_more and hold_open both have low marginal gain on 2/2
  • The refreshed phase columns now place:
    • first phase_switch at frame 214
    • first y_retrieve at frame 215
    • first y_ready at frame 234
  • The refined 8-episode independent-replay smoke in artifacts/results/iter12_parallel_smoke_8ep_refined/ shows:
    • single_switch_rate = 1.0
    • reversion_rate = 0.0
    • mean auroc_ppre_ypre β‰ˆ 0.809
    • mean auprc_ppre_ypre β‰ˆ 0.924
    • mean auroc_pext_yext = 1.0
    • mean auprc_pext_yext = 1.0
    • mean f1_phase_yretrieve β‰ˆ 0.996
    • mean f1_phase_yready β‰ˆ 0.906
    • mean dense boundary error to y_retrieve β‰ˆ 0.88 frames
    • mean pre-ready extract success = 0.0/2.0
    • mean pre-ready wait extract success = 0.0/2.0
    • mean post-ready extract success β‰ˆ 1.625/2.0
  • The main remaining limitation on this oven task is not a broken metric but task structure:
    • the grasp-region visibility metric is visually faithful but only weakly predictive because the tray lip is already visible early in many demos
    • time remains a very strong trivial baseline for y_ext on expert demos
    • open_more improves p_ext mainly near the reveal/retrieve boundary, not uniformly throughout the whole pre-ready window

See:

  • artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv
  • artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv
  • artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png
  • artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv
  • artifacts/results/oven_episode0_iter6_independent_full/episode0.metrics.json
  • artifacts/results/manual_metric_checks/episode0_frame210_visibility.png
  • artifacts/results/manual_metric_checks/episode0_frame232_visibility.png
  • artifacts/results/manual_metric_checks/episode0_frame210_path.png
  • artifacts/results/manual_metric_checks/episode6_frame230_path.png
  • artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_summary.json

Artifact Guide

Current artifacts

  • oven_episode0_iter3_templates/
    • First regenerated template bundle after the mask/approach fixes.
  • oven_episode0_iter4_templates/
    • Current template bundle with the corrected late-window approach onset.
  • oven_episode0_iter4_clean/
    • Isolated targeted frame checks used while diagnosing the old per-frame repair drift.
  • oven_episode0_iter4_batch/
    • Current batched sequential repair validation.
  • oven_episode0_iter4_dense_geom_170_234.csv
    • Dense sequential geometry and visibility sweep across the reveal-to-retrieve boundary.

Historical artifacts

  • oven_episode0_repaired_v1/
    • Useful historical reference, but not the current authoritative artifact.
    • It still contains the old late transition and old visibility/path issues.
  • oven_episode0_full*/, oven_to240_*/, oven_episode0_independent_v*/
    • Debugging history from earlier iterations.
  • parallel_smoke_2x10/
    • Xvfb/worker parallelization smoke test.
  • oven_smoke_*
    • Early smoke runs.

Repository Map

Relevant entry points:

  • code/rr_label_study/oven_study.py
  • code/scripts/run_oven_label_study.py
  • code/scripts/launch_parallel_oven_label_study.py
  • code/scripts/run_oven_single_frame.py
  • code/scripts/run_oven_frame_batch.py
  • code/scripts/repair_oven_episode_dense.py
  • code/scripts/render_oven_metric_frame.py
  • code/scripts/render_oven_metric_gifs.py

Relevant current artifacts:

  • artifacts/results/oven_episode0_iter4_templates/templates.json
  • artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv
  • artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv
  • artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv
  • artifacts/results/oven_episode0_iter6_independent_full/summary.json
  • artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png
  • artifacts/results/oven_episode0_iter6_visual_checks/early_visibility_contact_sheet.png
  • artifacts/results/oven_episode0_iter16_gif_suite/episode0.dense.csv
  • artifacts/results/oven_episode0_iter16_gif_suite/episode0.metrics.json
  • artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_all_metrics.gif
  • artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_visibility_focus.gif
  • artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_path_quality_focus.gif

Environment

This was run on:

  • Ubuntu 22.04.5
  • Kernel 6.8.0-65-generic
  • 96 CPU cores visible
  • 503 GiB RAM visible
  • NVIDIA A40

See:

  • environment/system_info.txt
  • environment/repo_revisions.txt
  • environment/conda_env_rlbench.yml
  • environment/pip_freeze_rlbench.txt

Upstream Repos Used

Exact revisions are recorded in environment/repo_revisions.txt.

The local run used:

  • markusgrotz/RLBench
  • markusgrotz/PyRep
  • markusgrotz/peract_bimanual
  • markusgrotz/YARR

Those source snapshots are included under external/.

Reproducing On The Same Hardware Class

  1. Read environment/dataset_notes.txt.
  2. Run environment/setup_same_hardware.sh /workspace.
  3. Source environment/activate_rlbench_runtime.sh /workspace.
  4. Run the dense study:
python /workspace/VLAdaptorBench_upload/code/scripts/run_oven_label_study.py \
  --dataset-root /workspace/data/bimanual_take_tray_out_of_oven_train_128 \
  --result-dir /workspace/tmp_run \
  --max-episodes 1 \
  --checkpoint-stride 16 \
  --template-episode-index 0 \
  --independent-replay
  1. If you want to repair suspicious frames in parallel with the new batched path:
python /workspace/VLAdaptorBench_upload/code/scripts/repair_oven_episode_dense.py \
  --dataset-root /workspace/data/bimanual_take_tray_out_of_oven_train_128 \
  --episode-dir /workspace/data/bimanual_take_tray_out_of_oven_train_128/all_variations/episodes/episode0 \
  --input-dense-csv /workspace/tmp_run/episode0.dense.csv \
  --output-dir /workspace/tmp_run_repaired \
  --checkpoint-stride 16 \
  --num-workers 4 \
  --base-display 170

Important Note

The full 100-episode independent-replay run is not yet the authoritative artifact in this upload. The current repository state documents the repaired metric code, the exact snapshot/restore fixes, and the episode-0 independent validation that is required before scaling to the full study.

Dataset Note

The RLBench demonstration dataset itself is not re-uploaded here. This repository contains the study code and generated artifacts only. The expected dataset path is documented in environment/dataset_notes.txt.

CoppeliaSim binaries are not included. The setup helpers expect a local extraction at /workspace/coppelia_sim.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading