-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2509.11986
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 82 -
LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
Paper • 2509.05263 • Published • 10 -
Symbolic Graphics Programming with Large Language Models
Paper • 2509.05208 • Published • 45 -
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper • 2509.12201 • Published • 103
-
Large Language Models are Locally Linear Mappings
Paper • 2505.24293 • Published • 14 -
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
Paper • 2507.07104 • Published • 45 -
KV Cache Steering for Inducing Reasoning in Small Language Models
Paper • 2507.08799 • Published • 40 -
A Survey of Reinforcement Learning for Large Reasoning Models
Paper • 2509.08827 • Published • 183
-
CoRAG: Collaborative Retrieval-Augmented Generation
Paper • 2504.01883 • Published • 9 -
SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
Paper • 2504.08600 • Published • 31 -
Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL
Paper • 2503.23157 • Published • 10 -
AI Agents: Evolution, Architecture, and Real-World Applications
Paper • 2503.12687 • Published • 2
-
Teaching Large Language Models to Reason with Reinforcement Learning
Paper • 2403.04642 • Published • 50 -
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 23 -
Common 7B Language Models Already Possess Strong Math Capabilities
Paper • 2403.04706 • Published • 20 -
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Paper • 2405.14333 • Published • 41
-
Vision language models are blind
Paper • 2407.06581 • Published • 83 -
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Paper • 2504.13169 • Published • 39 -
Lost in Embeddings: Information Loss in Vision-Language Models
Paper • 2509.11986 • Published • 27
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 38 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 34 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 43 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 29
-
End-to-End Vision Tokenizer Tuning
Paper • 2505.10562 • Published • 22 -
Global and Local Entailment Learning for Natural World Imagery
Paper • 2506.21476 • Published • 1 -
DINOv3
Paper • 2508.10104 • Published • 273 -
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
Paper • 2509.01363 • Published • 57
-
Analyzing The Language of Visual Tokens
Paper • 2411.05001 • Published • 24 -
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Paper • 2411.14982 • Published • 19 -
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
Paper • 2411.17686 • Published • 20 -
On the Limitations of Vision-Language Models in Understanding Image Transforms
Paper • 2503.09837 • Published • 10
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
Vision language models are blind
Paper • 2407.06581 • Published • 83 -
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Paper • 2504.13169 • Published • 39 -
Lost in Embeddings: Information Loss in Vision-Language Models
Paper • 2509.11986 • Published • 27
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 82 -
LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation
Paper • 2509.05263 • Published • 10 -
Symbolic Graphics Programming with Large Language Models
Paper • 2509.05208 • Published • 45 -
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper • 2509.12201 • Published • 103
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 38 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 34 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 43 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 29
-
Large Language Models are Locally Linear Mappings
Paper • 2505.24293 • Published • 14 -
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
Paper • 2507.07104 • Published • 45 -
KV Cache Steering for Inducing Reasoning in Small Language Models
Paper • 2507.08799 • Published • 40 -
A Survey of Reinforcement Learning for Large Reasoning Models
Paper • 2509.08827 • Published • 183
-
End-to-End Vision Tokenizer Tuning
Paper • 2505.10562 • Published • 22 -
Global and Local Entailment Learning for Natural World Imagery
Paper • 2506.21476 • Published • 1 -
DINOv3
Paper • 2508.10104 • Published • 273 -
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
Paper • 2509.01363 • Published • 57
-
CoRAG: Collaborative Retrieval-Augmented Generation
Paper • 2504.01883 • Published • 9 -
SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
Paper • 2504.08600 • Published • 31 -
Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL
Paper • 2503.23157 • Published • 10 -
AI Agents: Evolution, Architecture, and Real-World Applications
Paper • 2503.12687 • Published • 2
-
Analyzing The Language of Visual Tokens
Paper • 2411.05001 • Published • 24 -
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Paper • 2411.14982 • Published • 19 -
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
Paper • 2411.17686 • Published • 20 -
On the Limitations of Vision-Language Models in Understanding Image Transforms
Paper • 2503.09837 • Published • 10
-
Teaching Large Language Models to Reason with Reinforcement Learning
Paper • 2403.04642 • Published • 50 -
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 23 -
Common 7B Language Models Already Possess Strong Math Capabilities
Paper • 2403.04706 • Published • 20 -
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Paper • 2405.14333 • Published • 41