ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
Abstract
ImagerySearch, a prompt-guided adaptive test-time search strategy, enhances video generation in imaginative scenarios by dynamically adjusting search spaces and reward functions, outperforming existing methods on a new benchmark, LDT-Bench.
Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TTOM: Test-Time Optimization and Memorization for Compositional Video Generation (2025)
- GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation (2025)
- Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation (2025)
- SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation (2025)
- UniVid: The Open-Source Unified Video Model (2025)
- BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration (2025)
- World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper