VLAdaptorBench
This repository contains the benchmark setup, metric code, debug history, and validation artifacts for the proposed VLA + adaptor label study on bimanual_take_tray_out_of_oven.
This is still a label-validation repository, not a policy repository. No pi0.5 integration is included here.
Current Status
The latest work behind this upload produced:
metric_iter30_full100_single_pass_full_logging_fixed_templates_merged- merged 100-episode dense/fuller-logging result tree from the single-pass fixed-template run
The current Hub upload includes:
artifacts/results/metric_iter31_sample10_all_metrics_verify/- compact 10-episode verification subset with
all_metricsGIFs only
- compact 10-episode verification subset with
- the fast
all_metrics-only render path in:code/scripts/render_oven_metric_frame.pycode/scripts/render_oven_metric_gifs.py
The new sample verification bundle is meant to be the quickest remote sanity-check entry point. It includes the sampled dense/keyframe tables, per-episode metrics, fuller debug sidecars, fixed templates, selection metadata, and one compact full-metrics GIF per sampled episode.
The earlier metric_iter29_ep0_single_pass_full_logging_fixed_templates validation pass for episode 0 remains the detailed single-episode reference for the fuller debug logging and the debug-aware GIF renderer.
That run keeps the trusted iter24 template bundle fixed, adds the fuller dense/debug logging in a single pass, and regenerates the episode-0 visualization suite from the richer artifact. It is the current reference for:
- the
episode0.debug.jsonlsidecar with per-framep_preandp_extinternals - the single-pass dense CSV with fuller logged sub-metrics
- the updated
path_quality_focusGIF that now exposes thep_extmilestone search, milestone scores, and planner outcomes directly in the visualization
The earlier metric_iter24_*_door_contact_geom reruns for episodes 0 and 1 remain the trusted baseline for the repaired oven metrics.
That rerun fixes the main simulator-state bugs that were still contaminating the oven metrics:
- The reveal-to-retrieve transition used to occur too late, effectively at grasp time.
- The visibility metric used to drop to zero around frame 232 even when the tray grasp region was clearly visible in
wrist_left. p_prestayed near zero until grasp.- Extraction labels could flicker or drift because oracle rollouts were not restoring the simulator state exactly.
- The old dense runner's restore-heavy path could still bias later frames after an oracle call.
The current code addresses those issues by:
- decoding RLBench mask PNGs correctly before converting them back to simulator handles
- scoring visibility directly from mask-handle agreement instead of the old depth/z heuristic
- inferring tray mask handles from grasp-region projections
- deriving a late-window pregrasp approach template instead of accidentally including frame-8 arm poses
- adding explicit
pregrasp_progress,pregrasp_distance,pregrasp_speed, andphase_score - making the repair path batch frames sequentially per worker so late-frame rows do not drift
- snapshotting and restoring exact arm joints, gripper joints, and the full grasped-object subtree
- supporting and now preferring
--independent-replayfor the authoritative dense study - tightening
y_preso it stays on once the retriever is clearly inside the pregrasp corridor - retuning
phase_scoreso it tracks the reveal-to-retrieve handoff instead of generic early motion - recomputing intervention validity from isolated per-frame env replays instead of the old live-cache path
- sampling intervention states earlier in the reveal phase so pre-ready extract checks are not contaminated by borderline near-ready states
- confirming extraction feasibility with repeated planner checks inside the extract oracle so one lucky planner sample is less likely to flip a label
The old iter4_*, iter6_*, iter19_*, and iter22_* outputs are still useful historical checkpoints, but the current authoritative outputs are:
artifacts/results/metric_iter24_ep0_door_contact_geom/artifacts/results/metric_iter24_ep1_door_contact_geom/artifacts/results/metric_iter29_ep0_single_pass_full_logging_fixed_templates/
The main new fix in iter24 is the assisted-door contact scoring inside p_pre:
- the old
ignore_collisions=Truebranch treated oven-door contact as name-whitelisted and only checked the final door angle change - the new scorer traces door contacts step-by-step, estimates the local door-surface normal from simulator geometry, scores whether the retriever is sliding along the door or pushing it open, and penalizes direct head-on contact or door-closing motion
- this specifically removes the false closed-door
p_prespike in episode 0 around frames43-56without collapsing the later pregrasp rise once the door is actually opening
The current repo state should therefore be treated as the repaired benchmark snapshot with geometry-aware door assistance, not the final metric design.
Brief caveat: the current y_ready label still gates on low oven-door angular speed after extraction feasibility persists. In this task, the retriever arm can legitimately nudge the door while already committing to retrieval, so y_ready can still switch later than the true reveal-to-retrieve boundary. For the current oven benchmark, y_ready should therefore not be treated as a decisive validation metric or a trusted phase-switch target.
The oven task also has a highly structured reveal-to-retrieve handoff in the expert demos: both arms reposition, the revealer opens and clears the door, then the retriever commits. Because that phase pattern is so standardized, good results on this task are most useful as a task-specific smoke test or a "does the adaptor beat a base finetune here?" check, not as strong evidence of general reveal-and-retrieve reasoning.
What Is In This Upload
code/rr_label_study/- Core metric code.
- Dense replay, visibility scoring, pregrasp/extraction oracles, keyframe extraction, intervention checks, and summary metric computation.
code/scripts/- Study runners and helpers.
run_oven_label_study.py: dense/keyframe study runner.launch_parallel_oven_label_study.py: multi-display worker launcher.recompute_oven_pregrasp_parallel.py: targeted dense rerun for repairedp_prelabels.run_oven_pregrasp_batch.py: sequential per-worker pregrasp recomputation helper.refresh_saved_oven_study.py: recompute keyframes, per-episode metrics, intervention stats, and summary JSONs from saved dense CSVs after metric-code changes.run_oven_single_frame.py: single-frame recomputation helper.run_oven_frame_batch.py: new sequential batch recomputation helper used to avoid late-frame drift.repair_oven_episode_dense.py: batched repair pass for suspicious dense rows.render_oven_metric_frame.py: per-frame visualization renderer.render_oven_metric_gifs.py: GIF renderer.- The visualization renderer now accepts either legacy
templates.pklfiles or the newer authoritativetemplates.jsonbundles.
artifacts/results/- Full debug history, including stale runs and current validation outputs.
runtime_assets/- Archived runtime assets needed to recreate this setup on another machine.
- Includes the local oven-task dataset snapshot and the local
coppelia_simextraction used on this machine.
snapshots/- Compressed snapshot archives of large local payloads that were uploaded as bundles instead of expanded repo trees.
- In particular,
snapshots/VLAdaptorBench_upload_repo_payload_20260408.tar.gzcontains the broader local upload tree used on this machine.
environment/- Machine snapshot, env export, pip freeze, setup helpers, and dataset notes.
external/- Local source snapshots of RLBench, PyRep, PerAct bimanual, and YARR used for this work.
MANIFEST.txt- Flat file listing of the upload contents.
Latest Metric Fixes
The latest code changes are in:
code/rr_label_study/oven_study.pycode/scripts/recompute_oven_pregrasp_parallel.pycode/scripts/run_oven_pregrasp_batch.pycode/scripts/repair_oven_episode_dense.pycode/scripts/run_oven_frame_batch.pycode/scripts/render_oven_metric_frame.py
The important changes are:
1. Visibility metric repair
_load_mask()now rescales stored mask PNGs back to[0, 1]before callingrgb_handles_to_mask.- Visibility is now computed by projecting grasp-region or whole-tray points into each camera and checking whether the decoded mask handle at the projected pixel matches the inferred tray handles.
- Template derivation now infers
mask_handle_idsfrom reference frames near the actual pregrasp/grasp window.
This fixes the old failure where visibility dropped to zero even when the tray lip was visibly present in the wrist camera.
2. Pregrasp/path metric repair
- Template extraction now detects the pregrasp approach onset in a bounded late window before grasp instead of taking the first small negative slope in the entire episode.
- The current template approach frames for episode 0 are now:
177, 187, 197, 208, 218, 229, 232
p_prenow uses the last few approach templates plus explicit geometric progress toward the pregrasp pose instead of only brittle planner success.y_prenow treats "already inside the pregrasp corridor" as success, which is appropriate for this oracle study.- The assisted pregrasp branch no longer treats oven-door collisions as a binary whitelist:
- it traces per-step door contacts under
ignore_collisions=True - estimates a local door-surface normal from the contacted simulator shape
- rewards tangential or door-opening contact
- penalizes head-on or door-closing contact
- requires a minimum geometry-aware door-contact quality before assisted
p_precredit is given
- it traces per-step door contacts under
3. Replay/repair correctness
- The old isolated repair path replayed every suspicious frame from a fresh reset, which could corrupt late rows.
- The new helper
run_oven_frame_batch.pycomputes frame rows sequentially inside a single env per worker. repair_oven_episode_dense.pynow distributes frame batches, not individual frames, across displays.SimulatorSnapshotnow restores:- arm joint trees and explicit joint positions
- gripper joint trees and explicit joint positions
- the full subtree under any grasped object
- grasp attachments with the original release parent
ReplayCachenow keeps retrying stable grasp attachment while the demo gripper remains closed.
This fixed the major replay bug where post-oracle restores could leave the arm, gripper, or grasped tray in a subtly different state than the true demo frame.
4. Earlier phase signal
- The code now records:
pregrasp_progresspregrasp_distancepregrasp_speedphase_score
phase_scoreis now dominated by actual approach progress andp_pre, with a stricter threshold (0.5) so it no longer flips during the early reveal phase.y_retrieveis still oracle-like and monotone, but the metric side now has a cleaner approach-sensitive signal for early switching.
5. Independent replay
run_oven_label_study.pyalready exposed--independent-replay.launch_parallel_oven_label_study.pynow passes that flag through to worker runs.- For the current oven study, independent replay is the trustworthy dense mode because it avoids cross-frame contamination from oracle rollouts.
6. Intervention validity repair
- The old intervention summary reused the dense-study replay cache, which could still corrupt post-ready extract checks.
_interventional_validity()now evaluates each sampled intervention state from a fresh env/replay instance.refresh_saved_oven_study.pynow supports--dataset-rootso intervention metrics can be recomputed instead of copied forward from stale JSON.- The refined intervention protocol now samples pre-ready states at
ready_onset-20andready_onset-10instead ofready_onset-10andready_onset-5, which avoids counting borderline almost-ready states as generic reveal-phase interventions.
7. Extraction-oracle hardening
_extract_score_and_success()now uses repeated planner checks before marking a milestone as feasible.- The current configuration is intentionally modest:
DEFAULT_PLAN_ATTEMPTS = 2DEFAULT_PLAN_MIN_SUCCESSES = 2
- This only hardens the extraction oracle, not the pregrasp score, so the dense study remains tractable while the noisy pre-ready extract successes are suppressed.
Latest Validated Artifacts
The current trustworthy artifacts are:
artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.dense.csvartifacts/results/metric_iter24_ep0_door_contact_geom/episode0.keyframes.csvartifacts/results/metric_iter24_ep0_door_contact_geom/episode0.metrics.jsonartifacts/results/metric_iter24_ep0_door_contact_geom/summary.jsonartifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_all_metrics.gifartifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_visibility_focus.gifartifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_path_quality_focus.gifartifacts/results/metric_iter24_ep1_door_contact_geom/episode1.dense.csvartifacts/results/metric_iter24_ep1_door_contact_geom/episode1.keyframes.csvartifacts/results/metric_iter24_ep1_door_contact_geom/episode1.metrics.jsonartifacts/results/metric_iter24_ep1_door_contact_geom/summary.jsonartifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_all_metrics.gifartifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_visibility_focus.gifartifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_path_quality_focus.gifartifacts/results/oven_episode0_iter4_templates/templates.jsonartifacts/results/oven_episode0_iter4_templates/templates.pklartifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csvartifacts/results/oven_episode0_iter4_batch/frames/artifacts/results/oven_episode0_iter4_clean/iter4_targeted_comparison.csvartifacts/results/oven_episode0_iter4_dense_geom_170_234.csvartifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.pngartifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csvartifacts/results/oven_episode0_iter6_independent_full/episode0.keyframes.csvartifacts/results/oven_episode0_iter6_independent_full/episode0.metrics.jsonartifacts/results/oven_episode0_iter6_independent_full/summary.jsonartifacts/results/oven_episode0_iter6_visual_checks/early_visibility_contact_sheet.pngartifacts/results/oven_episode0_iter16_gif_suite/episode0.dense.csvartifacts/results/oven_episode0_iter16_gif_suite/episode0.metrics.jsonartifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_all_metrics.gifartifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_visibility_focus.gifartifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_path_quality_focus.gifartifacts/results/manual_metric_checks/episode0_frame210_visibility.pngartifacts/results/manual_metric_checks/episode0_frame232_visibility.pngartifacts/results/manual_metric_checks/episode0_frame210_path.pngartifacts/results/manual_metric_checks/episode6_frame230_path.pngartifacts/results/iter12_parallel_smoke_8ep_refined/parallel_summary.jsonartifacts/results/iter12_parallel_smoke_8ep_refined/parallel_workers.json
The iter6_independent_full CSVs and JSON summaries have been refreshed with the latest phase_score logic via code/scripts/refresh_saved_oven_study.py.
Key Verified Findings
From the current independent-replay validation on episode 0:
- Visibility over the dense 170-234 window is clean:
- min
three_view_visibility = 1.0 - min
full_view_visibility = 1.0
- min
- Pregrasp progress now rises well before grasp and stays predictive:
- frame
210:pregrasp_progress β 0.451,p_pre β 0.185,y_pre = 0 - frame
215:pregrasp_progress β 0.568,p_pre β 0.375,y_pre = 1 - frame
220:pregrasp_progress β 0.702,p_pre β 0.496,y_pre = 1 - frame
225:pregrasp_progress β 0.847,p_pre β 0.559,y_pre = 1 - frame
230:pregrasp_progress β 0.950,p_pre β 0.654,y_pre = 1
- frame
- Extraction feasibility is now separated from pregrasp:
- frame
230:p_ext β 0.0007,y_ext = 0 - frame
232:p_ext = 1.0,y_ext = 1 - frame
234:p_ext = 1.0,y_ext = 1
- frame
- In the refreshed full independent episode-0 run:
ppre_cross_frame = 216pext_cross_frame = 232phase_cross_frame = 214retrieve_cross_frame = 215ready_cross_frame = 234single_switch_rate = 1.0reversion_rate = 0.0auroc_ppre_ypre β 0.761auprc_ppre_ypre β 0.903auroc_pext_yext = 1.0auprc_pext_yext = 1.0auroc_phase_yretrieve = 1.0auprc_phase_yretrieve = 1.0f1_phase_yretrieve β 0.996auroc_phase_yready β 0.998f1_phase_yready β 0.905
- In the refreshed isolated intervention check on episode 0:
- pre-ready
open_moreincreasesp_exton2/2sampled states - pre-ready
extractsucceeds on0/2 - post-ready
extractsucceeds on2/2 - post-ready
open_moreandhold_openboth have low marginal gain on2/2
- pre-ready
- The refreshed phase columns now place:
first phase_switchat frame214first y_retrieveat frame215first y_readyat frame234
- The refined 8-episode independent-replay smoke in
artifacts/results/iter12_parallel_smoke_8ep_refined/shows:single_switch_rate = 1.0reversion_rate = 0.0- mean
auroc_ppre_ypre β 0.809 - mean
auprc_ppre_ypre β 0.924 - mean
auroc_pext_yext = 1.0 - mean
auprc_pext_yext = 1.0 - mean
f1_phase_yretrieve β 0.996 - mean
f1_phase_yready β 0.906 - mean dense boundary error to
y_retrieve β 0.88frames - mean pre-ready extract success
= 0.0/2.0 - mean pre-ready wait extract success
= 0.0/2.0 - mean post-ready extract success
β 1.625/2.0
- The main remaining limitation on this oven task is not a broken metric but task structure:
- the grasp-region visibility metric is visually faithful but only weakly predictive because the tray lip is already visible early in many demos
- time remains a very strong trivial baseline for
y_exton expert demos open_moreimprovesp_extmainly near the reveal/retrieve boundary, not uniformly throughout the whole pre-ready window
See:
artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csvartifacts/results/oven_episode0_iter4_dense_geom_170_234.csvartifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.pngartifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csvartifacts/results/oven_episode0_iter6_independent_full/episode0.metrics.jsonartifacts/results/manual_metric_checks/episode0_frame210_visibility.pngartifacts/results/manual_metric_checks/episode0_frame232_visibility.pngartifacts/results/manual_metric_checks/episode0_frame210_path.pngartifacts/results/manual_metric_checks/episode6_frame230_path.pngartifacts/results/iter12_parallel_smoke_8ep_refined/parallel_summary.json
Artifact Guide
Current artifacts
oven_episode0_iter3_templates/- First regenerated template bundle after the mask/approach fixes.
oven_episode0_iter4_templates/- Current template bundle with the corrected late-window approach onset.
oven_episode0_iter4_clean/- Isolated targeted frame checks used while diagnosing the old per-frame repair drift.
oven_episode0_iter4_batch/- Current batched sequential repair validation.
oven_episode0_iter4_dense_geom_170_234.csv- Dense sequential geometry and visibility sweep across the reveal-to-retrieve boundary.
Historical artifacts
oven_episode0_repaired_v1/- Useful historical reference, but not the current authoritative artifact.
- It still contains the old late transition and old visibility/path issues.
oven_episode0_full*/,oven_to240_*/,oven_episode0_independent_v*/- Debugging history from earlier iterations.
parallel_smoke_2x10/- Xvfb/worker parallelization smoke test.
oven_smoke_*- Early smoke runs.
Repository Map
Relevant entry points:
code/rr_label_study/oven_study.pycode/scripts/run_oven_label_study.pycode/scripts/launch_parallel_oven_label_study.pycode/scripts/run_oven_single_frame.pycode/scripts/run_oven_frame_batch.pycode/scripts/repair_oven_episode_dense.pycode/scripts/render_oven_metric_frame.pycode/scripts/render_oven_metric_gifs.py
Relevant current artifacts:
artifacts/results/oven_episode0_iter4_templates/templates.jsonartifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csvartifacts/results/oven_episode0_iter4_dense_geom_170_234.csvartifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csvartifacts/results/oven_episode0_iter6_independent_full/summary.jsonartifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.pngartifacts/results/oven_episode0_iter6_visual_checks/early_visibility_contact_sheet.pngartifacts/results/oven_episode0_iter16_gif_suite/episode0.dense.csvartifacts/results/oven_episode0_iter16_gif_suite/episode0.metrics.jsonartifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_all_metrics.gifartifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_visibility_focus.gifartifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_path_quality_focus.gif
Environment
This was run on:
- Ubuntu
22.04.5 - Kernel
6.8.0-65-generic 96CPU cores visible503 GiBRAM visibleNVIDIA A40
See:
environment/system_info.txtenvironment/repo_revisions.txtenvironment/conda_env_rlbench.ymlenvironment/pip_freeze_rlbench.txt
Upstream Repos Used
Exact revisions are recorded in environment/repo_revisions.txt.
The local run used:
markusgrotz/RLBenchmarkusgrotz/PyRepmarkusgrotz/peract_bimanualmarkusgrotz/YARR
Those source snapshots are included under external/.
Reproducing On The Same Hardware Class
- Read
environment/dataset_notes.txt. - Run
environment/setup_same_hardware.sh /workspace. - Source
environment/activate_rlbench_runtime.sh /workspace. - Run the dense study:
python /workspace/VLAdaptorBench_upload/code/scripts/run_oven_label_study.py \
--dataset-root /workspace/data/bimanual_take_tray_out_of_oven_train_128 \
--result-dir /workspace/tmp_run \
--max-episodes 1 \
--checkpoint-stride 16 \
--template-episode-index 0 \
--independent-replay
- If you want to repair suspicious frames in parallel with the new batched path:
python /workspace/VLAdaptorBench_upload/code/scripts/repair_oven_episode_dense.py \
--dataset-root /workspace/data/bimanual_take_tray_out_of_oven_train_128 \
--episode-dir /workspace/data/bimanual_take_tray_out_of_oven_train_128/all_variations/episodes/episode0 \
--input-dense-csv /workspace/tmp_run/episode0.dense.csv \
--output-dir /workspace/tmp_run_repaired \
--checkpoint-stride 16 \
--num-workers 4 \
--base-display 170
Important Note
The full 100-episode independent-replay run is not yet the authoritative artifact in this upload. The current repository state documents the repaired metric code, the exact snapshot/restore fixes, and the episode-0 independent validation that is required before scaling to the full study.
Dataset Note
The RLBench demonstration dataset itself is not re-uploaded here. This repository contains the study code and generated artifacts only. The expected dataset path is documented in environment/dataset_notes.txt.
CoppeliaSim binaries are not included. The setup helpers expect a local extraction at /workspace/coppelia_sim.