Multi-Task DiT Policy β Block Tower (Config Fix)
Diffusion Transformer (DiT) policy trained on villekuosmanen/build_block_tower + DAgger rounds 1.0.0β1.4.0 for robotic block stacking. Training config (batch size, learning rate, resize, horizon) recommended by the repo author based on what worked for him.
Training Details
| Parameter | Value |
|---|---|
| Architecture | DiT with CLIP ViT-B/16 vision encoder + CLIP text conditioning |
| Dataset | build_block_tower + 5 DAgger rounds |
| State/Action dim | 17D β joint_pos(7) + eef_xyz(3) + rot6d(6) + gripper(1) |
| Delta actions | All dims except 6D rotation (absolute) |
| Normalization | Ramen (q02/q98 percentile, per-timestep, per-dim, clipped [-1.5, 1.5]); 6D rotation exempt |
| Batch size | 80 per GPU, 320 global (4x GPUs) |
| Training steps | 40,000 / 50,000 (in progress) |
| Learning rate | 3e-4, cosine schedule, 500 warmup steps |
| Diffusion | DDIM, 100 train timesteps, 20 inference steps |
| Horizon | 32 |
| Action steps | 32 |
| Obs steps | 2 |
| Vision resize | 224x224, no crop |
| Mixed precision | AMP |
| Optimizer | Adam, grad clip 1.0 |
| Hardware | 1 node, 4x NVIDIA GH200 (Isambard-AI AIP2) |
| Training time | ~42h across 2 runs (24h + 18h resume) |
| Final loss | ~0.008β0.012 |
Checkpoints
| Checkpoint | Steps | sha256 (model.safetensors) |
|---|---|---|
checkpoint_40000 |
40k | 455f0f6f...1f7c29 |
Each checkpoint contains:
model.safetensorsβ model weights (~1.3GB)config.jsonβ model configurationramen_stats.ptβ normalization statistics (required for inference)
Task
Stack a coloured wooden block on top of an existing block tower.
W&B
Usage
from multitask_dit_policy.model import MultiTaskDiTPolicy
policy = MultiTaskDiTPolicy.load("pravsels/dit_block_tower_config_fix/checkpoint_40000")