Multimodal Agent - a btjhjeon Collection

btjhjeon 's Collections

Multimodal Agent

Multimodal System

Multimodal Reasoning

Multimodal Analysis

Multimodal Alignment

PEFT

LLM

LLM context length

Multimodal Dataset

Multimodal Benchmarks

Multimodal Agent

updated 4 days ago

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25 • 28
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 51
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 89
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Paper • 2501.11733 • Published Jan 20 • 29
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Paper • 2503.12533 • Published Mar 16 • 69
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27 • 23
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27 • 63
OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1, 2024 • 26
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Paper • 2505.06111 • Published May 9 • 25
Visual Agentic Reinforcement Fine-Tuning

Paper • 2505.14246 • Published May 20 • 32
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

Paper • 2505.15517 • Published May 21 • 4
Interactive Post-Training for Vision-Language-Action Models

Paper • 2505.17016 • Published May 22 • 6
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Paper • 2505.10887 • Published May 16 • 10
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Paper • 2505.21497 • Published May 27 • 108
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Paper • 2505.20289 • Published May 26 • 10
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Paper • 2505.23747 • Published May 29 • 68
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Paper • 2505.24878 • Published May 30 • 23
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 126
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

Paper • 2506.00411 • Published May 31 • 31
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

Paper • 2506.02387 • Published Jun 3 • 57
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Paper • 2506.03143 • Published Jun 3 • 50
Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Paper • 2506.02454 • Published Jun 3 • 5
SAFE: Multitask Failure Detection for Vision-Language-Action Models

Paper • 2506.09937 • Published Jun 11 • 9
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Paper • 2506.10357 • Published Jun 12 • 22
VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Paper • 2506.10821 • Published Jun 12 • 20
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

Paper • 2506.07961 • Published Jun 9 • 12
EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

Paper • 2506.10100 • Published Jun 11 • 10
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Paper • 2506.09930 • Published Jun 11 • 8
Unified Vision-Language-Action Model

Paper • 2506.19850 • Published Jun 24 • 27
WorldVLA: Towards Autoregressive Action World Model

Paper • 2506.21539 • Published Jun 26 • 39
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2 • 35
RoboBrain 2.0 Technical Report

Paper • 2507.02029 • Published Jul 2 • 29
PresentAgent: Multimodal Agent for Presentation Video Generation

Paper • 2507.04036 • Published Jul 5 • 10
A Survey on Vision-Language-Action Models for Autonomous Driving

Paper • 2506.24044 • Published Jun 30 • 14
PyVision: Agentic Vision with Dynamic Tooling

Paper • 2507.07998 • Published 30 days ago • 31
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Paper • 2507.16815 • Published 18 days ago • 35
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Paper • 2507.22827 • Published 10 days ago • 88
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Paper • 2507.23682 • Published 9 days ago • 22
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Paper • 2503.15558 • Published Mar 18 • 51
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Paper • 2507.17520 • Published 17 days ago • 12
RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems

Paper • 2508.01415 • Published 7 days ago • 6