Papers
arxiv:2604.25819

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

Published on Apr 28
· Submitted by
zhou
on Apr 29
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Mutual Forcing enables efficient autoregressive audio-video generation through a unified model that combines few-step and multi-step training modes with shared parameters for improved consistency and reduced overhead.

AI-generated summary

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.

Community

Recent fast streaming generation methods often rely on a complicated pipeline, typically starting from a non-causal bidirectional diffusion model and requiring additional steps such as ODE initialization and DMD distillation. Mutual Forcing explores a simpler alternative: directly training an autoregressive model and using self-distillation for acceleration. This design removes unnecessary complexity and leads to a more efficient training recipe.

Why Mutual Forcing?

  • Teacher-free training
  • Direct learning from real data
  • Flexible training sequence lengths
  • Few-step and multi-step generation in one model
  • Self-distillation through shared weights

Audio-Video Joint Generation Demo

Paper submitter

Teaser

mutual_forcing_teaser

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.25819
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.25819 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.25819 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.25819 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.