Reconstruction Alignment Improves Unified Multimodal Models Paper • 2509.07295 • Published Sep 8 • 39
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions Paper • 2509.06951 • Published Sep 8 • 31
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward Paper • 2509.06818 • Published Sep 8 • 29
Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling Paper • 2509.01624 • Published Sep 1 • 7
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference Paper • 2509.06942 • Published Sep 8 • 16
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation Paper • 2509.15185 • Published about 1 month ago • 28
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis Paper • 2509.10441 • Published Sep 12 • 30
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning Paper • 2509.08519 • Published Sep 10 • 125
MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement Paper • 2509.01977 • Published Sep 2 • 12
GenCompositor: Generative Video Compositing with Diffusion Transformer Paper • 2509.02460 • Published Sep 2 • 25
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning Paper • 2508.20751 • Published Aug 28 • 89
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published 30 days ago • 52
Lynx: Towards High-Fidelity Personalized Video Generation Paper • 2509.15496 • Published about 1 month ago • 12
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models Paper • 2509.17627 • Published 27 days ago • 64
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation Paper • 2509.19244 • Published 26 days ago • 11
Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation Paper • 2509.18824 • Published 26 days ago • 21
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation Paper • 2510.05094 • Published 13 days ago • 34
Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs Paper • 2509.25771 • Published 19 days ago • 10
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation Paper • 2510.01284 • Published 19 days ago • 30
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation Paper • 2510.02283 • Published 17 days ago • 88