Submitted by Hennara 124 Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR Misraj Ai 9
Submitted by taesiri 48 MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe · 34 authors 22.1k 4
Submitted by Silin-Chen 36 SWE-QA: Can Language Models Answer Repository-level Code Questions? · 6 authors 28 2
Submitted by Two-hot 28 How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective · 18 authors 13 2
Submitted by lhmd 23 VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction · 10 authors 117 4
Submitted by taesiri 22 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation · 13 authors 536 4
Submitted by Yunzhen 22 What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT · 5 authors 2
Submitted by taesiri 22 Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation · 7 authors 2
Submitted by jbarrow 18 CommonForms: A Large, Diverse Dataset for Form Field Detection · 1 authors 823 2
Submitted by ZipW 7 HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis · 2 authors 58 2
Submitted by MinhDucBui 7 Large Language Models Discriminate Against Speakers of German Dialects · 5 authors 2
Submitted by ultra7chen 6 CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching · 10 authors 2
Submitted by emilia-wisnios 3 OpenGVL - Benchmarking Visual Temporal Progress for Data Curation · 6 authors 2
Submitted by conan1024hao 2 VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction · 14 authors 2 2
Submitted by Fictionary 2 GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction · 7 authors 123 2
Submitted by spapi 2 Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation · 4 authors 2
Submitted by taesiri 1 Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications · 7 authors 2
Submitted by jesbu1 1 PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies · 9 authors 3 2
Submitted by abhilekhborah - DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture · 9 authors 1 2