End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Abstract
Resampling Forcing is introduced as a teacher-free framework to train autoregressive video diffusion models with improved temporal consistency using self-resampling and history routing.
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.
Community
Hi!
The method is proposed to overcome limitations of self-forcing (reliance on teacher model, GAN loss..). Why does it need a warmup by adopting self-forcing objective? How's the result without self-forcing warmup?
The abstract claims to enable training AR model from scratch? Any results without pretrained weights?
arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/end-to-end-training-for-autoregressive-video-diffusion-via-self-resampling-9902-d8f704fd
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper