foundation model
updated
DreamLLM: Synergistic Multimodal Comprehension and Creation
Paper
• 2309.11499
• Published
• 60
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published
• 90
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
• 2405.09818
• Published
• 132
No Time to Waste: Squeeze Time into Channel for Mobile Video
Understanding
Paper
• 2405.08344
• Published
• 15
KAN or MLP: A Fairer Comparison
Paper
• 2407.16674
• Published
• 43
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Paper
• 2407.11895
• Published
• 7
VILA^2: VILA Augmented VILA
Paper
• 2407.17453
• Published
• 41
Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Paper
• 2407.20229
• Published
• 7
POA: Pre-training Once for Models of All Sizes
Paper
• 2408.01031
• Published
• 27
Improving Text Embeddings for Smaller Language Models Using Contrastive
Fine-tuning
Paper
• 2408.00690
• Published
• 25
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
• 2408.10188
• Published
• 52
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
• 2408.11039
• Published
• 63
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published
• 51
TWLV-I: Analysis and Insights from Holistic Evaluation on Video
Foundation Models
Paper
• 2408.11318
• Published
• 56