Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models Paper • 2504.15271 • Published Apr 21 • 66
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published Mar 6 • 95
VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion Paper • 2302.12251 • Published Feb 23, 2023
Prismer: A Vision-Language Model with An Ensemble of Experts Paper • 2303.02506 • Published Mar 4, 2023 • 2
SSCBench: Monocular 3D Semantic Scene Completion Benchmark in Street Views Paper • 2306.09001 • Published Jun 15, 2023
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Paper • 2105.15203 • Published May 31, 2021 • 1
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers Paper • 2109.03814 • Published Sep 8, 2021
FB-BEV: BEV Representation from Forward-Backward View Transformations Paper • 2308.02236 • Published Aug 4, 2023
FocalFormer3D : Focusing on Hard Instance for 3D Object Detection Paper • 2308.04556 • Published Aug 8, 2023 • 9
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching Paper • 2402.14167 • Published Feb 21, 2024 • 12
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions Paper • 2205.13803 • Published May 27, 2022
LITA: Language Instructed Temporal-Localization Assistant Paper • 2403.19046 • Published Mar 27, 2024 • 20
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models Paper • 2209.07511 • Published Sep 15, 2022
X-VILA: Cross-Modality Alignment for Large Language Model Paper • 2405.19335 • Published May 29, 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders Paper • 2408.15998 • Published Aug 28, 2024 • 88