Ming Chen's picture

40

Ming Chen

ChenMing-thu14

·

AI & ML interests

3D Human Pose Estimation

Recent Activity

upvoted a paper 6 days ago

In-Video Instructions: Visual Signals as Generative Control

upvoted a paper 7 days ago

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

upvoted a paper 7 days ago

SAM 3: Segment Anything with Concepts

View all activity

Organizations

upvoted a paper 6 days ago

In-Video Instructions: Visual Signals as Generative Control

Paper • 2511.19401 • Published 6 days ago • 28

upvoted 2 papers 7 days ago

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Paper • 2511.15705 • Published 11 days ago • 88

SAM 3: Segment Anything with Concepts

Paper • 2511.16719 • Published 10 days ago • 94

upvoted a paper 10 days ago

SAM 3D: 3Dfy Anything in Images

Paper • 2511.16624 • Published 10 days ago • 100

upvoted 2 papers 11 days ago

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Paper • 2511.14993 • Published 12 days ago • 218

MHR: Momentum Human Rig

Paper • 2511.15586 • Published 11 days ago • 13

upvoted a paper 20 days ago

Cambrian-S: Towards Spatial Supersensing in Video

Paper • 2511.04670 • Published 24 days ago • 35

upvoted a paper 27 days ago

The End of Manual Decoding: Towards Truly End-to-End Language Models

Paper • 2510.26697 • Published Oct 30 • 113

upvoted 6 papers about 1 month ago

Emu3.5: Native Multimodal Models are World Learners

Paper • 2510.26583 • Published Oct 30 • 104

LongCat-Video Technical Report

Paper • 2510.22200 • Published Oct 25 • 25

Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

Paper • 2510.20187 • Published Oct 23 • 18

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Paper • 2510.09608 • Published Oct 10 • 50

RL makes MLLMs see better than SFT

Paper • 2510.16333 • Published Oct 18 • 47

Glyph: Scaling Context Windows via Visual-Text Compression

Paper • 2510.17800 • Published Oct 20 • 67

authored a paper about 1 month ago

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Paper • 2509.09595 • Published Sep 11 • 48

upvoted 5 papers about 1 month ago

Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset

Paper • 2510.16258 • Published Oct 17 • 7

FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published Oct 20 • 67

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Paper • 2510.12276 • Published Oct 14 • 144

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Paper • 2510.15870 • Published Oct 17 • 88

Latent Diffusion Model without Variational Autoencoder

Paper • 2510.15301 • Published Oct 17 • 48