-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2504.13180
-
FocusedAD: Character-centric Movie Audio Description
Paper • 2504.12157 • Published • 9 -
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper • 2504.10465 • Published • 27 -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 17 -
OS-Copilot/OS-Atlas-Base-7B
Image-Text-to-Text • 8B • Updated • 1.63k • 40
-
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Paper • 2504.16064 • Published • 14 -
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Paper • 2504.14032 • Published • 4 -
Towards Understanding Camera Motions in Any Video
Paper • 2504.15376 • Published • 159 -
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper • 2504.17192 • Published • 114
-
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Paper • 2504.09081 • Published • 17 -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 17 -
Sekai: A Video Dataset towards World Exploration
Paper • 2506.15675 • Published • 65 -
WorldVLA: Towards Autoregressive Action World Model
Paper • 2506.21539 • Published • 39
-
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper • 2409.02097 • Published • 35 -
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion
Paper • 2409.11406 • Published • 28 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 127 -
Segment Anything with Multiple Modalities
Paper • 2408.09085 • Published • 23
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Paper • 2504.16064 • Published • 14 -
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Paper • 2504.14032 • Published • 4 -
Towards Understanding Camera Motions in Any Video
Paper • 2504.15376 • Published • 159 -
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper • 2504.17192 • Published • 114
-
FocusedAD: Character-centric Movie Audio Description
Paper • 2504.12157 • Published • 9 -
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper • 2504.10465 • Published • 27 -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 17 -
OS-Copilot/OS-Atlas-Base-7B
Image-Text-to-Text • 8B • Updated • 1.63k • 40
-
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Paper • 2504.09081 • Published • 17 -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 17 -
Sekai: A Video Dataset towards World Exploration
Paper • 2506.15675 • Published • 65 -
WorldVLA: Towards Autoregressive Action World Model
Paper • 2506.21539 • Published • 39
-
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper • 2409.02097 • Published • 35 -
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion
Paper • 2409.11406 • Published • 28 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 127 -
Segment Anything with Multiple Modalities
Paper • 2408.09085 • Published • 23