Multi-Task DiT Policy β€” Block Tower (Config Fix)

Diffusion Transformer (DiT) policy trained on villekuosmanen/build_block_tower + DAgger rounds 1.0.0–1.4.0 for robotic block stacking. Training config (batch size, learning rate, resize, horizon) recommended by the repo author based on what worked for him.

Training Details

Parameter Value
Architecture DiT with CLIP ViT-B/16 vision encoder + CLIP text conditioning
Dataset build_block_tower + 5 DAgger rounds
State/Action dim 17D β€” joint_pos(7) + eef_xyz(3) + rot6d(6) + gripper(1)
Delta actions All dims except 6D rotation (absolute)
Normalization Ramen (q02/q98 percentile, per-timestep, per-dim, clipped [-1.5, 1.5]); 6D rotation exempt
Batch size 80 per GPU, 320 global (4x GPUs)
Training steps 40,000 / 50,000 (in progress)
Learning rate 3e-4, cosine schedule, 500 warmup steps
Diffusion DDIM, 100 train timesteps, 20 inference steps
Horizon 32
Action steps 32
Obs steps 2
Vision resize 224x224, no crop
Mixed precision AMP
Optimizer Adam, grad clip 1.0
Hardware 1 node, 4x NVIDIA GH200 (Isambard-AI AIP2)
Training time ~42h across 2 runs (24h + 18h resume)
Final loss ~0.008–0.012

Checkpoints

Checkpoint Steps sha256 (model.safetensors)
checkpoint_40000 40k 455f0f6f...1f7c29

Each checkpoint contains:

  • model.safetensors β€” model weights (~1.3GB)
  • config.json β€” model configuration
  • ramen_stats.pt β€” normalization statistics (required for inference)

Task

Stack a coloured wooden block on top of an existing block tower.

W&B

Usage

from multitask_dit_policy.model import MultiTaskDiTPolicy

policy = MultiTaskDiTPolicy.load("pravsels/dit_block_tower_config_fix/checkpoint_40000")
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Datasets used to train pravsels/dit_block_tower_config_fix