xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16, 2024 • 100
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations Paper • 2408.12590 • Published Aug 22, 2024 • 36
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant Paper • 2403.11299 • Published Mar 17, 2024 • 1
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs Paper • 2410.16267 • Published Oct 21, 2024 • 18
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models Paper • 2411.15024 • Published Nov 22, 2024 • 1
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding Paper • 2502.11492 • Published Feb 17 • 2
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 97
HoliTom: Holistic Token Merging for Fast Video Large Language Models Paper • 2505.21334 • Published May 27 • 21
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents Paper • 2507.04590 • Published Jul 7 • 16
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios Paper • 2507.20198 • Published Jul 27 • 26
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG Paper • 2510.03663 • Published 15 days ago • 15
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models Paper • 2503.16257 • Published Mar 20 • 25