Small Vision-Language Models are Smart Compressors for Long Video Understanding Paper • 2604.08120 • Published 4 days ago • 15
Caption Anything: Interactive Image Description with Diverse Multimodal Controls Paper • 2305.02677 • Published May 4, 2023 • 1
Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents Paper • 2411.16740 • Published Nov 23, 2024 • 2
WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation Paper • 2503.19065 • Published Mar 24, 2025 • 11
Small Vision-Language Models are Smart Compressors for Long Video Understanding Paper • 2604.08120 • Published 4 days ago • 15
Small Vision-Language Models are Smart Compressors for Long Video Understanding Paper • 2604.08120 • Published 4 days ago • 15
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning Paper • 2307.16525 • Published Jul 31, 2023 • 1