VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting Paper β’ 2510.21817 β’ Published Oct 21, 2025 β’ 42
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? Paper β’ 2509.03516 β’ Published Sep 3, 2025 β’ 12
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension Paper β’ 2503.08689 β’ Published Mar 11, 2025 β’ 4
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Paper β’ 2411.13093 β’ Published Nov 20, 2024 β’ 2
meta-llama/Meta-Llama-3-8B-Instruct Text Generation β’ 8B β’ Updated Jun 18, 2025 β’ 1.51M β’ β’ 4.36k