Building on HF

merve PRO

merve

ifkash's profile picture

ror's profile picture

rawatraghav's profile picture

https://github.com/merveenoyan/smol-vision

mervenoyann
merveenoyan
merve.bsky.social

AI & ML interests

I love this website VLMs, vision & co

Recent Activity

updated a dataset 3 days ago

merve/agent_traces

published a dataset 3 days ago

merve/agent_traces

liked a dataset 4 days ago

badlogicgames/pi-mono

View all activity

Organizations

merve 's collections 90

Apr 3 Releases

netflix/void-model

Video-to-Video • Updated 8 days ago • 798
arcee-ai/Trinity-Large-Thinking

Text Generation • 399B • Updated 5 days ago • 15.1k • • 152
KRAFTON/Raon-VisionEncoder

Feature Extraction • Updated 13 days ago • 502 • 18
KRAFTON/Raon-SpeechChat-9B

Audio-to-Audio • 10B • Updated about 21 hours ago • 760 • 27

Multimodal tool calling datasets

AgoraX/OpenImage-FNCall-50k

Viewer • Updated Feb 14, 2024 • 53.3k • 49 • 3
ScaleAI/VisualToolBench

Viewer • Updated Dec 16, 2025 • 1.2k • 527 • 4
internlm/ARM-Thinker-Data

Preview • Updated Feb 13 • 56 • 7

Jan 19 Releases

Nemotron ColEmbed V2

Collection

State-of-the-Art Late Interaction Vision-Language Embedding Models • 3 items • Updated 7 days ago • 12
Qwen/Qwen3-TTS-12Hz-1.7B-Base

Updated Jan 23 • 1.4M • 368
fal/flux-2-klein-4B-outpaint-lora

Image-to-Image • Updated Jan 21 • • 67
Qwen/Qwen3-TTS-Tokenizer-12Hz

Audio-to-Audio • Updated Jan 29 • 62.1k • 57

YOLO26 Models

YOLO26 models: detection, segmentation, classification, pose, and OBB variants with demos and ONNX variants.

Running

26

YOLO26

💙

26

Process images with advanced object detection and segmentation
Running

Featured

65

YOLO26 WebGPU

🏆

65

Real-time object detection & pose estimation in your browser
onnx-community/yolo26x-ONNX

Updated Jan 18 • 27 • 5
openvision/yoloe26-n-seg

Zero-Shot Object Detection • Updated Jan 15 • 27 • 2

Dec 30 Releases

Wuli-art/Qwen-Image-2512-Turbo-LoRA

Text-to-Image • Updated Jan 8 • 9.92k • 213
miromind-ai/MiroThinker-v1.5-235B

Text Generation • 235B • Updated 25 days ago • 122 • 254
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover

Image-to-Image • Updated Jan 4 • 8.71k • • 56
tencent/Youtu-LLM-2B-Base

Text Generation • Updated Feb 24 • 3.44k • 42

Dec 12 Releases

openai/circuit-sparsity

Text Generation • 0.4B • Updated Dec 12, 2025 • 1.36k • 204
FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Text-to-Speech • Updated Feb 3 • 73.7k • 511
DiffSynth-Studio/Qwen-Image-i2L

Updated Dec 16, 2025 • 255
Aratako/T5Gemma-TTS-2b-2b

Text-to-Speech • 5B • Updated 11 days ago • 590 • 114

SAM3

facebook/sam3

Mask Generation • 0.9B • Updated Nov 20, 2025 • 1.94M • 1.86k
Running on Zero

Featured

112

SAM3 Video Segmentation

🐠

112

Track and label objects in videos using text prompts or clicks
onnx-community/sam3-tracker-ONNX

Mask Generation • Updated Nov 19, 2025 • 1.19k • 30
Running

30

SAM3 Tracker WebGPU

🎯

30

Segment images with click points and download cutouts

Oct 6 Releases

Kwaipilot/KAT-Dev-72B-Exp

Text Generation • 73B • Updated Oct 13, 2025 • 36 • 157
LiquidAI/LFM2-8B-A1B

Text Generation • 8B • Updated 15 days ago • 49.8k • 351
yanolja/YanoljaNEXT-Rosetta-12B-2510

Translation • 12B • Updated Nov 2, 2025 • 115 • 29
NeuML/colbert-muvera-femto

Sentence Similarity • 243k • Updated Dec 12, 2025 • 4 • 20

Sep 23 Releases

ByteDance/lynx

Image-to-Video • Updated Sep 27, 2025 • • 138
tencent/HunyuanImage-3.0

Text-to-Image • Updated Jan 28 • 18.4k • • 662
meituan-longcat/LongCat-Flash-Thinking

Text Generation • Updated Sep 24, 2025 • 137 • 146
Qwen/Qwen3Guard-Gen-4B

Text Generation • 4B • Updated Nov 7, 2025 • 4.65k • 40

Sep 11 Releases

bytedance-research/HuMo

Image-to-Video • Updated Sep 18, 2025 • 85 • 216
facebook/MobileLLM-R1-950M

Text Generation • 0.9B • Updated Sep 30, 2025 • 521 • 283
tencent/POINTS-Reader

Image-Text-to-Text • 4B • Updated Sep 12, 2025 • 185 • 102
baidu/ERNIE-4.5-21B-A3B-Thinking

Text Generation • 22B • Updated Nov 26, 2025 • 712 • 777

August 29 Releases

microsoft/VibeVoice-1.5B

Text-to-Speech • 3B • Updated Jan 22 • 95.9k • 2.32k
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview

Image-Text-to-Text • 0.4B • Updated Aug 29, 2025 • 46.5k • 82
apple/FastVLM-1.5B

Text Generation • 2B • Updated Sep 3, 2025 • 2.05k • 80
stepfun-ai/Step-Audio-2-mini

Any-to-Any • Updated Feb 14 • 2k • 254

Releases August 9

openai/gpt-oss-120b

Text Generation • 120B • Updated Aug 26, 2025 • 3.47M • • 4.68k
openai/gpt-oss-20b

Text Generation • 22B • Updated Aug 26, 2025 • 6.01M • • 4.53k
openai/BrowseCompLongContext

Viewer • Updated Aug 9, 2025 • 295 • 657 • 50
baichuan-inc/Baichuan-M2-32B

Text Generation • 33B • Updated Dec 24, 2025 • 92.3k • 120

Releases July 25

Wan-AI/Wan2.2-I2V-A14B

Image-to-Video • Updated Aug 7, 2025 • 8.72k • • 678
allenai/olmOCR-7B-0725

Image-Text-to-Text • 8B • Updated Aug 26, 2025 • 511 • 64
Wan-AI/Wan2.2-T2V-A14B

Text-to-Video • Updated Aug 7, 2025 • 16.4k • • 451
Qwen/Qwen3-235B-A22B-Thinking-2507

Text Generation • Updated Aug 17, 2025 • 79.6k • • 404

Releases July 11

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated Sep 10, 2025 • 1.08M • 930
moonshotai/Kimi-K2-Instruct

Text Generation • 1T • Updated Jan 30 • 175k • • 2.34k
fal/Realism-Detailer-Kontext-Dev-LoRA

Image-to-Image • Updated Jul 7, 2025 • 105 • • 53
Alibaba-NLP/WebSailor-3B

3B • Updated Jul 10, 2025 • 19 • 74

Releases June 27

nari-labs/Dia-1.6B-0626

Text-to-Speech • 2B • Updated Jul 3, 2025 • 19.4k • 129
google/gemma-3n-E4B-it

Image-Text-to-Text • Updated Jul 14, 2025 • 39k • • 899
ByteDance/XVerse

Text-to-Image • Updated Jul 1, 2025 • 34 • 89
nvidia/llama-nemoretriever-colembed-3b-v1

Visual Document Retrieval • Updated Feb 4 • 310 • 74

OCR Models & Datasets

opendatalab/OmniDocBench

Viewer • Updated 4 days ago • 1.65k • 14.4k • 80
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 24.9k • 1.59k
echo840/MonkeyOCR

Image-Text-to-Text • Updated Mar 3 • 302 • 515
Running on Zero

MCP

Featured

142

Multimodal OCR2

💻

142

FireRed / Nanonets / Monkey / Thyme / Typhoon / SmolDocling

Releases June 6

Qwen/Qwen3-Reranker-4B

Text Ranking • 4B • Updated Jun 9, 2025 • 746k • 124
echo840/MonkeyOCR

Image-Text-to-Text • Updated Mar 3 • 302 • 515
openbmb/MiniCPM4-8B

Text Generation • 8B • Updated Oct 24, 2025 • 1.43k • 283
arcee-ai/Homunculus

Text Generation • Updated Jun 3, 2025 • 21 • 99

Releases 23 May

ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jan 9 • 11.1k • 1.19k
mistralai/Devstral-Small-2505

24B • Updated Aug 18, 2025 • 95.1k • 866
ByteDance/Dolphin

Image-Text-to-Text • Updated Jul 16, 2025 • 450 • 515
moondream/moondream-2b-2025-04-14-4bit

Image-Text-to-Text • 1B • Updated May 22, 2025 • 11.5k • 68

May 9 Releases

tencent/HunyuanCustom

Image-to-Video • Updated Jun 6, 2025 • 191
stepfun-ai/Step1X-3D

Updated May 13, 2025 • 106
cognition-ai/Kevin-32B

33B • Updated May 6, 2025 • 131 • 164
ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Text Generation • Updated Nov 10, 2025 • 536 • 126

Releases Apr 21 & May 2

facebook/EdgeTAM

Updated Apr 30, 2025 • 3 • 31
nvidia/parakeet-tdt-0.6b-v2

Automatic Speech Recognition • Updated about 17 hours ago • 168k • 1.46k
deepseek-ai/DeepSeek-Prover-V2-671B

Text Generation • Updated Apr 30, 2025 • 1.31k • • 825
Qwen/Qwen2.5-Omni-3B

Any-to-Any • Updated Apr 30, 2025 • 446k • 332

April 16 Releases

giskardai/realharm

Viewer • Updated Apr 16, 2025 • 136 • 46 • 12
Junfeng5/Liquid_V1_7B

Any-to-Any • Updated Mar 20, 2025 • 2.41k • 94

April 11 Releases

moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jan 30 • 104k • 447
agentica-org/DeepCoder-14B-Preview

Text Generation • Updated May 11, 2025 • 412 • • 680
HiDream-ai/HiDream-I1-Full

Text-to-Image • Updated Jul 17, 2025 • 23.1k • • 989
OpenGVLab/InternVL3-78B

Image-Text-to-Text • Updated Sep 11, 2025 • 40.2k • 234

March 21 Releases

docling-project/SmolDocling-256M-preview

Image-Text-to-Text • Updated Sep 17, 2025 • 51.9k • 1.61k
sesame/csm-1b

Text-to-Speech • Updated Dec 1, 2025 • 154k • 2.36k
mistralai/Mistral-Small-3.1-24B-Instruct-2503

Updated Dec 22, 2025 • 532k • 1.36k
tencent/Hunyuan3D-2mini

Image-to-3D • Updated Oct 17, 2025 • 14.7k • 129

Feb 14 Releases 💌

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 13.9k • 89
AIDC-AI/Ovis2-34B

Image-Text-to-Text • 35B • Updated Aug 15, 2025 • 301 • 142
open-r1/OpenR1-Qwen-7B

Text Generation • 8B • Updated May 28, 2025 • 35 • • 54
nomic-ai/nomic-embed-text-v2-moe

Sentence Similarity • 0.5B • Updated Apr 1, 2025 • 2.03M • 465

January 31 Releases 🧤

allenai/Llama-3.1-Tulu-3-405B

Text Generation • Updated Feb 10, 2025 • 2.11k • 111
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6, 2025 • 108k • • 606
mistralai/Mistral-Small-24B-Instruct-2501

Updated Jul 28, 2025 • 114k • 950
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1, 2025 • 64k • 3.57k

Jan 24 Releases

ostris/Flex.1-alpha

Text-to-Image • Updated Jan 19, 2025 • 330 • 481
Qwen/Qwen2.5-Math-PRM-72B

Text Classification • 73B • Updated Jan 17, 2025 • 148 • 73
HuggingFaceTB/SmolVLM-500M-Instruct

Image-Text-to-Text • 0.5B • Updated Apr 8, 2025 • 48.5k • 192
deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27, 2025 • 3.51M • • 13.2k

Jan 10 Releases 🌨️

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23, 2025 • 2.67M • 1.4k
DAMO-NLP-SG/multimodal_textbook

Updated Mar 17, 2025 • 1.4k • 159
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 609 • 29
nvidia/Cosmos-1.0-Autoregressive-4B

Updated Feb 11, 2025 • 18 • 56

Nov 29 Releases 🌲🌲

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8, 2025 • 29.6k • 583
Qwen/QwQ-32B-Preview

Text Generation • 33B • Updated Jan 12, 2025 • 8.07k • • 1.74k
nvidia/Hymba-1.5B-Base

Text Generation • 2B • Updated Nov 26, 2025 • 434 • 157
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14, 2025 • 61 • 55

Nov 15 Releases 🍂

microsoft/LLM2CLIP-EVA02-L-14-336

Zero-Shot Image Classification • Updated Nov 22, 2024 • 72 • 61
microsoft/LLM2CLIP-EVA02-B-16

Updated Feb 8, 2025 • 87 • 11
PleIAs/common_corpus

Viewer • Updated Feb 19 • 69.9k • 242k • 390
Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • 33B • Updated Jan 12, 2025 • 1.13M • • 2k

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 74
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121

LOTUS 🪷

Runtime error

Featured

101

Lotus Normal

🌍

101

Official Demo of Lotus (https://lotus3d.github.io/)
Runtime error

78

Lotus Depth

🚀

78

Official Demo of Lotus (https://lotus3d.github.io/)
jingheya/lotus-depth-g-v1-0

Depth Estimation • Updated Oct 5, 2024 • 9.74k • 27
jingheya/lotus-depth-d-v1-0

Depth Estimation • Updated Oct 5, 2024 • 420 • 5

BRAVE Models 🦁

Models mentioned in https://huggingface.co/papers/2404.07204

facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 1.33M • 105
google/flan-t5-xl

3B • Updated Nov 28, 2023 • 94.8k • 532
google/siglip-large-patch16-384

Zero-Shot Image Classification • 0.7B • Updated Sep 26, 2024 • 45.4k • 11
google/vit-huge-patch14-224-in21k

Image Feature Extraction • 0.6B • Updated Feb 14, 2024 • 2.99k • 22

Image Classification Models 🐶 🐱

facebook/deit-base-distilled-patch16-384

Image Classification • 87.6M • Updated Sep 12, 2023 • 190k • • 8
facebook/convnextv2-base-1k-224

Image Classification • 88.7M • Updated Feb 17, 2025 • 5.43k • • 4
facebook/deit-base-distilled-patch16-224

Image Classification • Updated Jul 13, 2022 • 6.24k • • 33
google/vit-base-patch32-384

Image Classification • 88.3M • Updated Sep 11, 2023 • 4.49k • • 23

Image Segmentation Models 💜

A collection of instance/semantic/panoptic segmentation models.

facebook/maskformer-swin-large-coco

Image Segmentation • 0.2B • Updated Sep 11, 2023 • 408 • 27
nvidia/segformer-b0-finetuned-ade-512-512

Image Segmentation • 3.75M • Updated Jan 14, 2024 • 568k • • 184
facebook/detr-resnet-50-dc5-panoptic

Image Segmentation • 43M • Updated Sep 11, 2023 • 48 • 3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024

Image Segmentation • Updated Aug 9, 2022 • 38.1k • • 41

Image-to-Image Models 🎨

Collection of image to image editing, image enhancement (SR, deblur, brighten) and text-to-image adapter models.

timbrooks/instruct-pix2pix

Image-to-Image • Updated Jul 5, 2023 • 40k • 1.17k
TencentARC/t2i-adapter-canny-sdxl-1.0

Image-to-Image • Updated Sep 7, 2023 • 3.55k • 52
TencentARC/t2i-adapter-sketch-sdxl-1.0

Image-to-Image • Updated Sep 8, 2023 • 4.54k • 75
CrucibleAI/ControlNetMediaPipeFace

Image-to-Image • Updated May 19, 2023 • 895 • 576

Image-to-Text Models 📝

This collection contains image captioning and OCR models.

Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 1.52M • 1.47k
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 2.26M • 847
microsoft/trocr-base-handwritten

Image-to-Text • 0.3B • Updated Feb 11, 2025 • 154k • 489
microsoft/git-large-coco

Image-to-Text • 0.4B • Updated Jun 26, 2023 • 4.01k • 105

Foundation Models for Vision 🧩

Foundation models for computer vision.

Running

120

Grounding DINO Demo

💻

120

Cutting edge open-vocabulary object detection app
Running

Featured

103

Owlv2

👀

103

State-of-the-art Zero-shot Object Detection
Configuration error

Featured

41

BLIP2 with transformers

🌖

41

BLIP2 (cutting edge image captioning) in 🤗transformers
Build error

Featured

377

IDEFICS Playground

🐨

377

OWL-series 🦉

Models and applications of OWL-ViT and OWLv2.

Running

Featured

103

Owlv2

👀

103

State-of-the-art Zero-shot Object Detection
Runtime error

Featured

64

Owl Tracking

⚡

64

Powerful foundation model for zero-shot object tracking
Running

26

Search and Detect (CLIP/OWL-ViT)

🦉

26

Search and detect objects in images using text queries
Running on Zero

Featured

110

OWLSAM

😻

110

State-of-the-art open-vocabulary image segmentation ⚡️

Awesome Document AI

A collection of open-source document AI 📄 📝 📈

Runtime error

Featured

84

UDOP

🏃

84

Generate text from document images
Configuration error

40

Pix2struct

📚

40

Play with all the pix2struct variants in this d
Running

26

Compare Docvqa Models

🦀

26

Compare different visual question answering
Runtime error

Featured

290

DocQuery — Document Query Engine

🦉

290

Vision Language Models Papers 🖼️💬📝

Papers about vision-language models, most important ones are on top of the list.

Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 39
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 49
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 11
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29, 2024 • 27

gv-hf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 167k • 148
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 11.8k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 13.3k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 25.9k • 30

Depth Anything v2 Release

A comprehensive collection on DAv2

depth-anything/Depth-Anything-V2-Small

Depth Estimation • Updated Jul 8, 2024 • 10.2k • 77
depth-anything/Depth-Anything-V2-Large

Depth Estimation • Updated Jul 8, 2024 • 96.9k • 154
Running on Zero

656

Depth Anything V2

🌖

656

Generate depth maps from your photos
depth-anything/DA-2K

Viewer • Updated Jun 14, 2024 • 1.04k • 361 • 17

Vision Language Leaderboards

This collection has all the vision language leaderboards.

Running

203

Vidore Leaderboard

🥇

203

Browse and compare visual document retrieval model scores
Running on CPU Upgrade

1.01k

Open VLM Leaderboard

🌎

1.01k

VLMEvalKit Evaluation Results Collection
Running

Featured

562

Vision Arena (Testing VLMs side-by-side)

🖼

562

Explore Vision Arena’s computer‑vision tools online
Build error

Featured

85

SEED-Bench Leaderboard

🏆

85

Submit model evaluation results to leaderboard

SAM2

All the models and demos for SAM2

merve/sam2-hiera-tiny

Mask Generation • Updated Aug 2, 2024 • 28
merve/sam2-hiera-small

Mask Generation • Updated Aug 2, 2024 • 29 • 2
merve/sam2-hiera-large

Mask Generation • Updated Aug 2, 2024 • 32 • 2
merve/sam2-hiera-base-plus

Mask Generation • Updated Aug 2, 2024 • 50

Multimodal RAG

vidore/colpali-v1.2

Visual Document Retrieval • Updated Mar 14, 2025 • 30.4k • 112
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.2M • 1.27k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • Updated Jan 12, 2025 • 2.36M • 498
Qwen/Qwen2-72B-Instruct

Text Generation • 73B • Updated Oct 8, 2024 • 70.1k • • 718

super cool vision language datasets

ServiceNow/ui-vision

Viewer • Updated May 7, 2025 • 1.46k • 6.51k • 21
xxxllz/Chart2Code-160k

Updated Jul 7, 2025 • 228 • 11
ReCAP-Agent/ReCAP-187k-SFT

Viewer • Updated 18 days ago • 188k • 40 • 6
allenai/MolmoPoint-GUISyn

Viewer • Updated 11 days ago • 37k • 906 • 10

Jan 26 Releases

robbyant/lingbot-world-base-cam

Image-to-Video • Updated Feb 2 • 330
nvidia/C-RADIOv4-H

Feature Extraction • Updated Jan 30 • 2.98k • 66
deepseek-ai/DeepSeek-OCR-2

Image-Text-to-Text • 3B • Updated Feb 3 • 1.27M • 899
arcee-ai/Trinity-Large-Base

Text Generation • 399B • Updated 13 days ago • 398 • 56

Jan 12 Releases

google/translategemma-27b-it

Image-Text-to-Text • Updated Jan 28 • 14k • 363
kakaocorp/kanana-2-30b-a3b-mid-2601

Text Generation • 31B • Updated Jan 15 • 9 • 30
black-forest-labs/FLUX.2-klein-base-4B

Image-to-Image • Updated Feb 24 • 1.67M • • 121
google/translategemma-12b-it

Image-Text-to-Text • Updated Jan 28 • 13.5k • 288

Jan 5 Releases

LiquidAI/LFM2.5-VL-1.6B

Image-Text-to-Text • 2B • Updated 15 days ago • 124k • 270
openbmb/AgentCPM-Explore

Text Generation • 4B • Updated Jan 18 • 147 • 328
Phr00t/LTX2-Rapid-Merges

Image-Text-to-Video • Updated Feb 12 • 348
LiquidAI/LFM2.5-1.2B-Base

Text Generation • 1B • Updated 15 days ago • 21.9k • 122

Dec 19 Releases

nvidia/NitroGen

Reinforcement Learning • Updated Feb 5 • 523
google/gemma-scope-2

Updated Dec 19, 2025 • 81
FunAudioLLM/Fun-ASR-MLT-Nano-2512

Updated Dec 23, 2025 • 149 • 43
facebook/map-anything-v1

Image-to-3D • 0.6B • Updated Dec 19, 2025 • 2.63k • 26

Real-time Vision Models

A collection of real-time detectors.

PekingU/rtdetr_v2_r50vd

Object Detection • 43M • Updated Feb 6, 2025 • 17.7k • 28
ustc-community/dfine-xlarge-obj365

Object Detection • 63.4M • Updated May 5, 2025 • 500 • 5
PekingU/rtdetr_v2_r101vd

Object Detection • 76.8M • Updated Feb 6, 2025 • 3.03k • 14
Running on T4

132

RF-DETR

🔥

132

SOTA real-time object detection model

MetaCLIP2 Multilingual

facebook/metaclip-2-worldwide-s16

Zero-Shot Image Classification • 0.4B • Updated Nov 12, 2025 • 102 • 9
facebook/metaclip-2-worldwide-m16

Zero-Shot Image Classification • 0.5B • Updated Nov 12, 2025 • 791 • 4
facebook/metaclip-2-worldwide-l14

Zero-Shot Image Classification • 1B • Updated Nov 12, 2025 • 415 • 13
facebook/metaclip-2-worldwide-b32

Zero-Shot Image Classification • 0.6B • Updated Nov 12, 2025 • 259 • 7

Sep 30 Releases

deepseek-ai/DeepSeek-V3.2-Exp

Text Generation • Updated Nov 18, 2025 • 171k • • 981
Qwen3-VL

Collection

37 items • Updated Dec 31, 2025 • 689
SDLM

Collection

Sequential Diffusion Language Models • 4 items • Updated Mar 2 • 8
Ming-V2

Collection

Ming is the multi-modal series of any-to-any models developed by Ant Ling team. • 14 items • Updated 20 days ago • 35

Sep 16 Releases

ibm-granite/granite-docling-258M

Image-Text-to-Text • Updated Sep 23, 2025 • 54.3k • 1.15k
XiaomiMiMo/MiMo-Audio-7B-Base

Any-to-Any • 8B • Updated Sep 23, 2025 • 58 • 48
decart-ai/Lucy-Edit-Dev

Video-to-Video • Updated Nov 20, 2025 • 416 • 335
OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 108 • 11

Sep 1 Releases

openbmb/MiniCPM4.1-8B

Text Generation • Updated Oct 24, 2025 • 24.3k • 386
tencent/Hunyuan-MT-7B

Translation • 8B • Updated Dec 30, 2025 • 24.8k • 553
google/embeddinggemma-300m

Sentence Similarity • 0.3B • Updated Sep 25, 2025 • 1.01M • • 1.59k
moonshotai/Kimi-K2-Instruct-0905

Text Generation • 1T • Updated Jan 30 • 240k • • 698

Aug 22 Releases

Qwen/Qwen-Image-Edit

Image-to-Image • Updated Aug 25, 2025 • 80.2k • • 2.37k
internlm/Intern-S1-mini

Image-Text-to-Text • 9B • Updated 16 days ago • 2.39k • 114
xai-org/grok-2

Updated Nov 5, 2025 • 21.7k • 1.06k
ByteDance-Seed/Seed-OSS-36B-Instruct

Text Generation • Updated Aug 26, 2025 • 13.7k • 495

Releases August 2

stepfun-ai/step3

Image-Text-to-Text • 321B • Updated Jan 29 • 142k • 166
nunchaku-ai/nunchaku-flux.1-krea-dev

Text-to-Image • Updated Nov 16, 2025 • 5.67k • 120
fdtn-ai/Foundation-Sec-8B-Instruct

Text Generation • 8B • Updated Aug 26, 2025 • 9.68k • • 67
Wan-AI/Wan2.2-TI2V-5B-Diffusers

Text-to-Video • Updated Aug 9, 2025 • 73.4k • 118

Releases July 18

nvidia/OpenReasoning-Nemotron-32B

Text Generation • 33B • Updated Sep 16, 2025 • 138k • • 123
ByteDance-Seed/Seed-X-RM-7B

Translation • Updated Jul 31, 2025 • 66 • 30
LGAI-EXAONE/EXAONE-4.0-32B

Text Generation • 32B • Updated Aug 4, 2025 • 23.5k • 282
vidore/colqwen-omni-v0.1

Visual Document Retrieval • Updated Jul 17, 2025 • 522 • 93

Releases July 4

apple/DiffuCoder-7B-cpGRPO

8B • Updated Dec 8, 2025 • 1.73k • 316
BAAI/MTVCraft

Text-to-Video • Updated Jul 7, 2025 • 13 • 36
kyutai/tts-1.6b-en_fr

Text-to-Speech • Updated Sep 11, 2025 • 34.9k • 374
apple/DiffuCoder-7B-Base

8B • Updated Dec 8, 2025 • 673 • 29

June 20 Releases

moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated Jan 30 • 65.1k • 355
mistralai/Mistral-Small-3.2-24B-Instruct-2506

Updated Dec 22, 2025 • 1.03M • 579
kyutai/stt-1b-en_fr

Automatic Speech Recognition • Updated Nov 18, 2025 • 124
google/magenta-realtime

Updated Aug 29, 2025 • 225 • 546

Releases June 13

ByteDance/LatentSync-1.6

Updated Jun 12, 2025 • 62.2k • 65
V-JEPA 2

Collection

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13, 2025 • 205
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 24.9k • 1.59k
tencent/Hunyuan3D-2.1

Image-to-3D • Updated Oct 17, 2025 • 38.4k • 890

Releases 30 May

All the releases of the week of 30th May.

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29, 2025 • 785k • • 2.42k
Running on Zero

Featured

216

BAGEL

🚀

216

Demo for BAGEL
tencent/HunyuanPortrait

Image-to-Video • Updated May 27, 2025 • 75
XiaomiMiMo/MiMo-7B-RL-0530

Text Generation • 8B • Updated Jun 5, 2025 • 435 • 44

May 16 Releases

Qwen/WorldPM-72B

Text Classification • 73B • Updated May 17, 2025 • 17 • 82
Paused

MCP

Featured

1.49k

LTX Video Fast

🎥

1.49k

ultra-fast video model, LTX 0.9.8 13B distilled
BLIP3o/BLIP3o-Pretrain-Long-Caption

Viewer • Updated Jun 26, 2025 • 27.2M • 4.16k • 59
BLIP3o/BLIP3o-Model-8B

Updated Jun 4, 2025 • 626 • 101

Any-to-Any Models, Datasets, Spaces

Runtime error

Featured

83

MMaDA

🌍

83

Demo for MMaDA: Multimodal Large Diffusion Language Models
Running on Zero

Featured

216

BAGEL

🚀

216

Demo for BAGEL
Gen-Verse/MMaDA-8B-Base

Any-to-Any • Updated May 24, 2025 • 895 • 89
ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jan 9 • 11.1k • 1.19k

InternVL3 HF

OpenGVLab/InternVL3-1B-hf

Image-Text-to-Text • 0.9B • Updated Apr 23, 2025 • 188k • 10
OpenGVLab/InternVL3-2B-hf

Image-Text-to-Text • 2B • Updated Apr 23, 2025 • 8.7k • 3
OpenGVLab/InternVL3-8B-hf

Image-Text-to-Text • 8B • Updated Apr 23, 2025 • 51.2k • 9
OpenGVLab/InternVL3-14B-hf

Image-Text-to-Text • 15B • Updated Apr 23, 2025 • 5.44k

Multimodal DSE Retrievers

A collection of DSE models for multimodal retrieval

racineai/Flantier-SmolVLM-2B-dse

2B • Updated Jun 18, 2025 • 4 • 11
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26, 2025 • 20.1k • 68
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 6.18k • 56
llamaindex/vdr-2b-multi-v1

Image-Text-to-Text • 2B • Updated 6 days ago • 1.57k • 128

March 28 Releases

deepseek-ai/DeepSeek-V3-0324

Text Generation • 685B • Updated Mar 27, 2025 • 537k • • 3.1k
Qwen/Qwen2.5-Omni-7B

Any-to-Any • Updated Apr 30, 2025 • 458k • 1.89k
google/txgemma-27b-chat

Text Generation • 27B • Updated Apr 10, 2025 • 95 • 59
Running

Featured

371

Qwen2.5 Omni 7B Demo

🏆

371

Chat with AI using text, audio, images, and video

Türkçe VLMler

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.2M • 1.27k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • Updated Jan 12, 2025 • 2.36M • 498
CohereLabs/aya-vision-8b

Image-Text-to-Text • 9B • Updated Jan 9 • 82.2k • 317
CohereLabs/aya-vision-32b

Image-Text-to-Text • Updated Jan 9 • 60 • • 224

Feb 7 Releases 🧣

lerobot/pi0_old

Robotics • 4B • Updated Sep 19, 2025 • 5.29k • 307
kyutai/hibiki-2b-pytorch-bf16

Translation • Updated May 28, 2025 • 24 • 61
Alpha-VLLM/Lumina-Image-2.0

Text-to-Image • Updated Mar 30, 2025 • 1.49k • • 358
adyen/DABstep

Viewer • Updated about 2 hours ago • 460 • 4.15k • 44

Models, Jan 27

Running on Zero

266

Qwen2-VL-7B

🔥

266

Answer questions about your images
Running

66

UI-TARS

🌖

66

Find click coordinates on images based on instructions
Running

100

Qwen2.5-1M Demo

💻

100

Ask questions about your uploaded documents
Qwen/Qwen2.5-14B-Instruct-1M

Text Generation • 15B • Updated Jan 29, 2025 • 4.51k • • 332

Jan 17 Releases ❄️

Models and datasets of the second week of Jan 2025.

openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5, 2025 • 111k • 1.29k
MiniMaxAI/MiniMax-Text-01

Text Generation • Updated Jul 3, 2025 • 11.8k • 652
OuteAI/OuteTTS-0.3-1B

Text-to-Speech • Updated Apr 24, 2025 • 132 • 108
NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14, 2025 • 16.4k • 3.17k • 186

Dec 6 Releases 🎄

meta-llama/Llama-3.3-70B-Instruct

Text Generation • 71B • Updated Dec 21, 2024 • 445k • • 2.7k
Qwen/Qwen2-VL-72B

Image-Text-to-Text • 73B • Updated Dec 6, 2024 • 286 • 80
google/paligemma2-3b-pt-224

Image-Text-to-Text • Updated Dec 5, 2024 • 18.6k • 167
tencent/HunyuanVideo

Text-to-Video • Updated Mar 6, 2025 • 1.27k • • 2.15k

Nov 22 Releases ❄️

mistralai/Pixtral-Large-Instruct-2411

Updated Jul 28, 2025 • 282 • 432
microsoft/orca-agentinstruct-1M-v1

Viewer • Updated Nov 1, 2024 • 1.05M • 1.08k • 460
Xkev/Llama-3.2V-11B-cot

Image-Text-to-Text • 11B • Updated Nov 16, 2025 • 5.25k • 158
jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated 6 days ago • 56.8k • 330

Nov 1 Releases

Running on Zero

88

LongVU

🌖

88

Generate responses to video or image inputs
facebook/MobileLLM-1B

Text Generation • Updated May 5, 2025 • 222 • 122
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28, 2025 • 189 • 76
Vision-CAIR/LongVU_Llama3_2_3B_img

Updated Feb 28, 2025 • 2 • 6

October 25 Releases

ibm-granite/granite-3.0-8b-instruct

Text Generation • Updated Dec 19, 2024 • 23k • 206
ibm-granite/granite-3.0-2b-instruct

Text Generation • 3B • Updated Dec 19, 2024 • 5.05k • 49
CohereLabs/aya-expanse-8b

Text Generation • 8B • Updated Jan 9 • 17.8k • 424
CohereLabs/aya-expanse-32b

Text Generation • 32B • Updated Jan 9 • 8.11k • • 291

New Depth Models

Recent depth models

Running on Zero

Featured

207

DepthCrafter

🦀

207

a super consistent video depth model
Paused

Featured

223

Depth Pro

🚀

223

Generate an inverse depth map from an image
Runtime error

78

Lotus Depth

🚀

78

Official Demo of Lotus (https://lotus3d.github.io/)
apple/DepthPro

Depth Estimation • Updated Feb 28, 2025 • 2.72k • 507

Computer Vision Backbones 🧩

Collection of useful computer vision backbones to fine-tune. It also includes large image classification models, that can be used as backbone.

microsoft/resnet-50

Image Classification • Updated Feb 13, 2024 • 247k • • 490
google/vit-base-patch16-224-in21k

Image Feature Extraction • 86.4M • Updated Feb 5, 2024 • 6.05M • 404
google/vit-base-patch32-224-in21k

Image Feature Extraction • 88M • Updated Dec 8, 2022 • 10.6k • 19
facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 1.33M • 105

Object Detection Models 🥥

facebook/detr-resnet-50

Object Detection • 41.6M • Updated Apr 10, 2024 • 222k • • 943
facebook/detr-resnet-101-dc5

Object Detection • 60.7M • Updated Sep 6, 2023 • 2.02k • 19
facebook/detr-resnet-50-dc5

Object Detection • 41.6M • Updated Sep 7, 2023 • 3.93k • 6
google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 167k • 148

Zero-shot Image Classification Models 🖼️

This is a collection for models that can be used for zero-shot image classification.

openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 28.7M • 1.99k
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 20.7M • 910
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Zero-Shot Image Classification • Updated Jan 22, 2025 • 70.9k • 309
kakaobrain/align-base

Zero-Shot Image Classification • Updated Mar 8, 2023 • 13.6k • 31

Video Classification Models 📺

microsoft/xclip-base-patch32

Video Classification • 0.2B • Updated Feb 4, 2024 • 233k • 109
facebook/timesformer-base-finetuned-k400

Video Classification • Updated Jan 2, 2023 • 16.1k • 43
facebook/timesformer-base-finetuned-k600

Video Classification • Updated Dec 12, 2022 • 19.8k • 12
google/vivit-b-16x2

Video Classification • Updated Aug 3, 2023 • 16.4k • 11

Text-to-Image Models 🥑

stabilityai/stable-diffusion-xl-base-1.0

Text-to-Image • Updated Oct 30, 2023 • 1.93M • • 7.61k
warp-ai/wuerstchen

Text-to-Image • Updated Mar 12, 2024 • 171 • 176
Deci/DeciDiffusion-v1-0

Text-to-Image • Updated Feb 15, 2024 • 17 • 140
stabilityai/stable-diffusion-xl-refiner-1.0

Image-to-Image • Updated Sep 25, 2023 • 246k • 2.03k

Segment Anything Model

This collection contains models and demos of SAM and it's smaller friends.

facebook/sam-vit-huge

Mask Generation • 0.6B • Updated Jan 11, 2024 • 192k • 192
facebook/sam-vit-base

Mask Generation • 93.7M • Updated Jan 11, 2024 • 386k • 165
facebook/sam-vit-large

Mask Generation • 0.3B • Updated Jan 11, 2024 • 22.3k • 33
Runtime error

43

Grounded SAM

💩

43

SigLIP

A collection dedicated to SigLIP applications

Running on Zero

Featured

73

Draw To Search Art

🐠

73

Draw/upload image and search among WikiART using SigLIP
Running on CPU Upgrade

23

Compare Clip Siglip

🏃

23

Compare strong zero-shot image classification models
Running on Zero

13

Multilingual Zero Shot Image Clf

🏢

13

Comparing powerful multilingual zero-shot image clf models
BAAI/bunny-phi-2-siglip-lora

Text Generation • Updated Mar 28, 2024 • 63 • 48

SegGPT

A collection of everything SegGPT.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Paper • 2212.02499 • Published Dec 5, 2022
SegGPT: Segmenting Everything In Context

Paper • 2304.03284 • Published Apr 6, 2023 • 1
BAAI/seggpt-vit-large

0.4B • Updated Feb 22, 2024 • 20.8k • 5
BAAI/SegGPT

Updated Apr 21, 2023 • 19

gvhf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 167k • 148
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 11.8k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 13.3k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 25.9k • 30

merve/owl2

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 167k • 148
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 11.8k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 13.3k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 25.9k • 30

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Video Language Models

A collection of video-language models

Paused

21

Video Llava

🐨

21

Generate descriptions by uploading images or videos
llava-hf/LLaVA-NeXT-Video-7B-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 87.8k • 121
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 398 • 11
llava-hf/LLaVA-NeXT-Video-7B-32K-hf

Image-Text-to-Text • 8B • Updated Nov 11, 2025 • 215 • 8

NVEagle

NVEagle/Eagle-X5-13B

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 19 • 15
NVEagle/Eagle-X5-13B-Chat

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 13 • 28
NVEagle/Eagle-X5-7B

Image-Text-to-Text • 9B • Updated Sep 16, 2024 • 18 • 26
Runtime error

64

Eagle X5 13B Chat

🚀

64

Combine text and images to generate responses

Zero-shot Segmentation

sam-hq-team/SegInW

Updated Jul 13, 2023 • 1
xdecoder/X-Decoder

Updated Dec 27, 2023 • 5
xdecoder/SEEM

Updated Dec 30, 2023 • 8
Runtime error

Featured

60

OWLSAM2

🏃

60

Apr 3 Releases

netflix/void-model

Video-to-Video • Updated 8 days ago • 798
arcee-ai/Trinity-Large-Thinking

Text Generation • 399B • Updated 5 days ago • 15.1k • • 152
KRAFTON/Raon-VisionEncoder

Feature Extraction • Updated 13 days ago • 502 • 18
KRAFTON/Raon-SpeechChat-9B

Audio-to-Audio • 10B • Updated about 21 hours ago • 760 • 27

super cool vision language datasets

ServiceNow/ui-vision

Viewer • Updated May 7, 2025 • 1.46k • 6.51k • 21
xxxllz/Chart2Code-160k

Updated Jul 7, 2025 • 228 • 11
ReCAP-Agent/ReCAP-187k-SFT

Viewer • Updated 18 days ago • 188k • 40 • 6
allenai/MolmoPoint-GUISyn

Viewer • Updated 11 days ago • 37k • 906 • 10

Multimodal tool calling datasets

AgoraX/OpenImage-FNCall-50k

Viewer • Updated Feb 14, 2024 • 53.3k • 49 • 3
ScaleAI/VisualToolBench

Viewer • Updated Dec 16, 2025 • 1.2k • 527 • 4
internlm/ARM-Thinker-Data

Preview • Updated Feb 13 • 56 • 7

Jan 26 Releases

robbyant/lingbot-world-base-cam

Image-to-Video • Updated Feb 2 • 330
nvidia/C-RADIOv4-H

Feature Extraction • Updated Jan 30 • 2.98k • 66
deepseek-ai/DeepSeek-OCR-2

Image-Text-to-Text • 3B • Updated Feb 3 • 1.27M • 899
arcee-ai/Trinity-Large-Base

Text Generation • 399B • Updated 13 days ago • 398 • 56

Jan 19 Releases

Nemotron ColEmbed V2

Collection

State-of-the-Art Late Interaction Vision-Language Embedding Models • 3 items • Updated 7 days ago • 12
Qwen/Qwen3-TTS-12Hz-1.7B-Base

Updated Jan 23 • 1.4M • 368
fal/flux-2-klein-4B-outpaint-lora

Image-to-Image • Updated Jan 21 • • 67
Qwen/Qwen3-TTS-Tokenizer-12Hz

Audio-to-Audio • Updated Jan 29 • 62.1k • 57

Jan 12 Releases

google/translategemma-27b-it

Image-Text-to-Text • Updated Jan 28 • 14k • 363
kakaocorp/kanana-2-30b-a3b-mid-2601

Text Generation • 31B • Updated Jan 15 • 9 • 30
black-forest-labs/FLUX.2-klein-base-4B

Image-to-Image • Updated Feb 24 • 1.67M • • 121
google/translategemma-12b-it

Image-Text-to-Text • Updated Jan 28 • 13.5k • 288

YOLO26 Models

YOLO26 models: detection, segmentation, classification, pose, and OBB variants with demos and ONNX variants.

Running

26

YOLO26

💙

26

Process images with advanced object detection and segmentation
Running

Featured

65

YOLO26 WebGPU

🏆

65

Real-time object detection & pose estimation in your browser
onnx-community/yolo26x-ONNX

Updated Jan 18 • 27 • 5
openvision/yoloe26-n-seg

Zero-Shot Object Detection • Updated Jan 15 • 27 • 2

Jan 5 Releases

LiquidAI/LFM2.5-VL-1.6B

Image-Text-to-Text • 2B • Updated 15 days ago • 124k • 270
openbmb/AgentCPM-Explore

Text Generation • 4B • Updated Jan 18 • 147 • 328
Phr00t/LTX2-Rapid-Merges

Image-Text-to-Video • Updated Feb 12 • 348
LiquidAI/LFM2.5-1.2B-Base

Text Generation • 1B • Updated 15 days ago • 21.9k • 122

Dec 30 Releases

Wuli-art/Qwen-Image-2512-Turbo-LoRA

Text-to-Image • Updated Jan 8 • 9.92k • 213
miromind-ai/MiroThinker-v1.5-235B

Text Generation • 235B • Updated 25 days ago • 122 • 254
prithivMLmods/Qwen-Image-Edit-2511-Object-Remover

Image-to-Image • Updated Jan 4 • 8.71k • • 56
tencent/Youtu-LLM-2B-Base

Text Generation • Updated Feb 24 • 3.44k • 42

Dec 19 Releases

nvidia/NitroGen

Reinforcement Learning • Updated Feb 5 • 523
google/gemma-scope-2

Updated Dec 19, 2025 • 81
FunAudioLLM/Fun-ASR-MLT-Nano-2512

Updated Dec 23, 2025 • 149 • 43
facebook/map-anything-v1

Image-to-3D • 0.6B • Updated Dec 19, 2025 • 2.63k • 26

Dec 12 Releases

openai/circuit-sparsity

Text Generation • 0.4B • Updated Dec 12, 2025 • 1.36k • 204
FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Text-to-Speech • Updated Feb 3 • 73.7k • 511
DiffSynth-Studio/Qwen-Image-i2L

Updated Dec 16, 2025 • 255
Aratako/T5Gemma-TTS-2b-2b

Text-to-Speech • 5B • Updated 11 days ago • 590 • 114

Real-time Vision Models

A collection of real-time detectors.

PekingU/rtdetr_v2_r50vd

Object Detection • 43M • Updated Feb 6, 2025 • 17.7k • 28
ustc-community/dfine-xlarge-obj365

Object Detection • 63.4M • Updated May 5, 2025 • 500 • 5
PekingU/rtdetr_v2_r101vd

Object Detection • 76.8M • Updated Feb 6, 2025 • 3.03k • 14
Running on T4

132

RF-DETR

🔥

132

SOTA real-time object detection model

SAM3

facebook/sam3

Mask Generation • 0.9B • Updated Nov 20, 2025 • 1.94M • 1.86k
Running on Zero

Featured

112

SAM3 Video Segmentation

🐠

112

Track and label objects in videos using text prompts or clicks
onnx-community/sam3-tracker-ONNX

Mask Generation • Updated Nov 19, 2025 • 1.19k • 30
Running

30

SAM3 Tracker WebGPU

🎯

30

Segment images with click points and download cutouts

MetaCLIP2 Multilingual

facebook/metaclip-2-worldwide-s16

Zero-Shot Image Classification • 0.4B • Updated Nov 12, 2025 • 102 • 9
facebook/metaclip-2-worldwide-m16

Zero-Shot Image Classification • 0.5B • Updated Nov 12, 2025 • 791 • 4
facebook/metaclip-2-worldwide-l14

Zero-Shot Image Classification • 1B • Updated Nov 12, 2025 • 415 • 13
facebook/metaclip-2-worldwide-b32

Zero-Shot Image Classification • 0.6B • Updated Nov 12, 2025 • 259 • 7

Oct 6 Releases

Kwaipilot/KAT-Dev-72B-Exp

Text Generation • 73B • Updated Oct 13, 2025 • 36 • 157
LiquidAI/LFM2-8B-A1B

Text Generation • 8B • Updated 15 days ago • 49.8k • 351
yanolja/YanoljaNEXT-Rosetta-12B-2510

Translation • 12B • Updated Nov 2, 2025 • 115 • 29
NeuML/colbert-muvera-femto

Sentence Similarity • 243k • Updated Dec 12, 2025 • 4 • 20

Sep 30 Releases

deepseek-ai/DeepSeek-V3.2-Exp

Text Generation • Updated Nov 18, 2025 • 171k • • 981
Qwen3-VL

Collection

37 items • Updated Dec 31, 2025 • 689
SDLM

Collection

Sequential Diffusion Language Models • 4 items • Updated Mar 2 • 8
Ming-V2

Collection

Ming is the multi-modal series of any-to-any models developed by Ant Ling team. • 14 items • Updated 20 days ago • 35

Sep 23 Releases

ByteDance/lynx

Image-to-Video • Updated Sep 27, 2025 • • 138
tencent/HunyuanImage-3.0

Text-to-Image • Updated Jan 28 • 18.4k • • 662
meituan-longcat/LongCat-Flash-Thinking

Text Generation • Updated Sep 24, 2025 • 137 • 146
Qwen/Qwen3Guard-Gen-4B

Text Generation • 4B • Updated Nov 7, 2025 • 4.65k • 40

Sep 16 Releases

ibm-granite/granite-docling-258M

Image-Text-to-Text • Updated Sep 23, 2025 • 54.3k • 1.15k
XiaomiMiMo/MiMo-Audio-7B-Base

Any-to-Any • 8B • Updated Sep 23, 2025 • 58 • 48
decart-ai/Lucy-Edit-Dev

Video-to-Video • Updated Nov 20, 2025 • 416 • 335
OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 108 • 11

Sep 11 Releases

bytedance-research/HuMo

Image-to-Video • Updated Sep 18, 2025 • 85 • 216
facebook/MobileLLM-R1-950M

Text Generation • 0.9B • Updated Sep 30, 2025 • 521 • 283
tencent/POINTS-Reader

Image-Text-to-Text • 4B • Updated Sep 12, 2025 • 185 • 102
baidu/ERNIE-4.5-21B-A3B-Thinking

Text Generation • 22B • Updated Nov 26, 2025 • 712 • 777

Sep 1 Releases

openbmb/MiniCPM4.1-8B

Text Generation • Updated Oct 24, 2025 • 24.3k • 386
tencent/Hunyuan-MT-7B

Translation • 8B • Updated Dec 30, 2025 • 24.8k • 553
google/embeddinggemma-300m

Sentence Similarity • 0.3B • Updated Sep 25, 2025 • 1.01M • • 1.59k
moonshotai/Kimi-K2-Instruct-0905

Text Generation • 1T • Updated Jan 30 • 240k • • 698

August 29 Releases

microsoft/VibeVoice-1.5B

Text-to-Speech • 3B • Updated Jan 22 • 95.9k • 2.32k
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview

Image-Text-to-Text • 0.4B • Updated Aug 29, 2025 • 46.5k • 82
apple/FastVLM-1.5B

Text Generation • 2B • Updated Sep 3, 2025 • 2.05k • 80
stepfun-ai/Step-Audio-2-mini

Any-to-Any • Updated Feb 14 • 2k • 254

Aug 22 Releases

Qwen/Qwen-Image-Edit

Image-to-Image • Updated Aug 25, 2025 • 80.2k • • 2.37k
internlm/Intern-S1-mini

Image-Text-to-Text • 9B • Updated 16 days ago • 2.39k • 114
xai-org/grok-2

Updated Nov 5, 2025 • 21.7k • 1.06k
ByteDance-Seed/Seed-OSS-36B-Instruct

Text Generation • Updated Aug 26, 2025 • 13.7k • 495

Releases August 9

openai/gpt-oss-120b

Text Generation • 120B • Updated Aug 26, 2025 • 3.47M • • 4.68k
openai/gpt-oss-20b

Text Generation • 22B • Updated Aug 26, 2025 • 6.01M • • 4.53k
openai/BrowseCompLongContext

Viewer • Updated Aug 9, 2025 • 295 • 657 • 50
baichuan-inc/Baichuan-M2-32B

Text Generation • 33B • Updated Dec 24, 2025 • 92.3k • 120

Releases August 2

stepfun-ai/step3

Image-Text-to-Text • 321B • Updated Jan 29 • 142k • 166
nunchaku-ai/nunchaku-flux.1-krea-dev

Text-to-Image • Updated Nov 16, 2025 • 5.67k • 120
fdtn-ai/Foundation-Sec-8B-Instruct

Text Generation • 8B • Updated Aug 26, 2025 • 9.68k • • 67
Wan-AI/Wan2.2-TI2V-5B-Diffusers

Text-to-Video • Updated Aug 9, 2025 • 73.4k • 118

Releases July 25

Wan-AI/Wan2.2-I2V-A14B

Image-to-Video • Updated Aug 7, 2025 • 8.72k • • 678
allenai/olmOCR-7B-0725

Image-Text-to-Text • 8B • Updated Aug 26, 2025 • 511 • 64
Wan-AI/Wan2.2-T2V-A14B

Text-to-Video • Updated Aug 7, 2025 • 16.4k • • 451
Qwen/Qwen3-235B-A22B-Thinking-2507

Text Generation • Updated Aug 17, 2025 • 79.6k • • 404

Releases July 18

nvidia/OpenReasoning-Nemotron-32B

Text Generation • 33B • Updated Sep 16, 2025 • 138k • • 123
ByteDance-Seed/Seed-X-RM-7B

Translation • Updated Jul 31, 2025 • 66 • 30
LGAI-EXAONE/EXAONE-4.0-32B

Text Generation • 32B • Updated Aug 4, 2025 • 23.5k • 282
vidore/colqwen-omni-v0.1

Visual Document Retrieval • Updated Jul 17, 2025 • 522 • 93

Releases July 11

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated Sep 10, 2025 • 1.08M • 930
moonshotai/Kimi-K2-Instruct

Text Generation • 1T • Updated Jan 30 • 175k • • 2.34k
fal/Realism-Detailer-Kontext-Dev-LoRA

Image-to-Image • Updated Jul 7, 2025 • 105 • • 53
Alibaba-NLP/WebSailor-3B

3B • Updated Jul 10, 2025 • 19 • 74

Releases July 4

apple/DiffuCoder-7B-cpGRPO

8B • Updated Dec 8, 2025 • 1.73k • 316
BAAI/MTVCraft

Text-to-Video • Updated Jul 7, 2025 • 13 • 36
kyutai/tts-1.6b-en_fr

Text-to-Speech • Updated Sep 11, 2025 • 34.9k • 374
apple/DiffuCoder-7B-Base

8B • Updated Dec 8, 2025 • 673 • 29

Releases June 27

nari-labs/Dia-1.6B-0626

Text-to-Speech • 2B • Updated Jul 3, 2025 • 19.4k • 129
google/gemma-3n-E4B-it

Image-Text-to-Text • Updated Jul 14, 2025 • 39k • • 899
ByteDance/XVerse

Text-to-Image • Updated Jul 1, 2025 • 34 • 89
nvidia/llama-nemoretriever-colembed-3b-v1

Visual Document Retrieval • Updated Feb 4 • 310 • 74

June 20 Releases

moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated Jan 30 • 65.1k • 355
mistralai/Mistral-Small-3.2-24B-Instruct-2506

Updated Dec 22, 2025 • 1.03M • 579
kyutai/stt-1b-en_fr

Automatic Speech Recognition • Updated Nov 18, 2025 • 124
google/magenta-realtime

Updated Aug 29, 2025 • 225 • 546

OCR Models & Datasets

opendatalab/OmniDocBench

Viewer • Updated 4 days ago • 1.65k • 14.4k • 80
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 24.9k • 1.59k
echo840/MonkeyOCR

Image-Text-to-Text • Updated Mar 3 • 302 • 515
Running on Zero

MCP

Featured

142

Multimodal OCR2

💻

142

FireRed / Nanonets / Monkey / Thyme / Typhoon / SmolDocling

Releases June 13

ByteDance/LatentSync-1.6

Updated Jun 12, 2025 • 62.2k • 65
V-JEPA 2

Collection

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13, 2025 • 205
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 24.9k • 1.59k
tencent/Hunyuan3D-2.1

Image-to-3D • Updated Oct 17, 2025 • 38.4k • 890

Releases June 6

Qwen/Qwen3-Reranker-4B

Text Ranking • 4B • Updated Jun 9, 2025 • 746k • 124
echo840/MonkeyOCR

Image-Text-to-Text • Updated Mar 3 • 302 • 515
openbmb/MiniCPM4-8B

Text Generation • 8B • Updated Oct 24, 2025 • 1.43k • 283
arcee-ai/Homunculus

Text Generation • Updated Jun 3, 2025 • 21 • 99

Releases 30 May

All the releases of the week of 30th May.

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29, 2025 • 785k • • 2.42k
Running on Zero

Featured

216

BAGEL

🚀

216

Demo for BAGEL
tencent/HunyuanPortrait

Image-to-Video • Updated May 27, 2025 • 75
XiaomiMiMo/MiMo-7B-RL-0530

Text Generation • 8B • Updated Jun 5, 2025 • 435 • 44

Releases 23 May

ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jan 9 • 11.1k • 1.19k
mistralai/Devstral-Small-2505

24B • Updated Aug 18, 2025 • 95.1k • 866
ByteDance/Dolphin

Image-Text-to-Text • Updated Jul 16, 2025 • 450 • 515
moondream/moondream-2b-2025-04-14-4bit

Image-Text-to-Text • 1B • Updated May 22, 2025 • 11.5k • 68

May 16 Releases

Qwen/WorldPM-72B

Text Classification • 73B • Updated May 17, 2025 • 17 • 82
Paused

MCP

Featured

1.49k

LTX Video Fast

🎥

1.49k

ultra-fast video model, LTX 0.9.8 13B distilled
BLIP3o/BLIP3o-Pretrain-Long-Caption

Viewer • Updated Jun 26, 2025 • 27.2M • 4.16k • 59
BLIP3o/BLIP3o-Model-8B

Updated Jun 4, 2025 • 626 • 101

May 9 Releases

tencent/HunyuanCustom

Image-to-Video • Updated Jun 6, 2025 • 191
stepfun-ai/Step1X-3D

Updated May 13, 2025 • 106
cognition-ai/Kevin-32B

33B • Updated May 6, 2025 • 131 • 164
ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Text Generation • Updated Nov 10, 2025 • 536 • 126

Any-to-Any Models, Datasets, Spaces

Runtime error

Featured

83

MMaDA

🌍

83

Demo for MMaDA: Multimodal Large Diffusion Language Models
Running on Zero

Featured

216

BAGEL

🚀

216

Demo for BAGEL
Gen-Verse/MMaDA-8B-Base

Any-to-Any • Updated May 24, 2025 • 895 • 89
ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jan 9 • 11.1k • 1.19k

Releases Apr 21 & May 2

facebook/EdgeTAM

Updated Apr 30, 2025 • 3 • 31
nvidia/parakeet-tdt-0.6b-v2

Automatic Speech Recognition • Updated about 17 hours ago • 168k • 1.46k
deepseek-ai/DeepSeek-Prover-V2-671B

Text Generation • Updated Apr 30, 2025 • 1.31k • • 825
Qwen/Qwen2.5-Omni-3B

Any-to-Any • Updated Apr 30, 2025 • 446k • 332

InternVL3 HF

OpenGVLab/InternVL3-1B-hf

Image-Text-to-Text • 0.9B • Updated Apr 23, 2025 • 188k • 10
OpenGVLab/InternVL3-2B-hf

Image-Text-to-Text • 2B • Updated Apr 23, 2025 • 8.7k • 3
OpenGVLab/InternVL3-8B-hf

Image-Text-to-Text • 8B • Updated Apr 23, 2025 • 51.2k • 9
OpenGVLab/InternVL3-14B-hf

Image-Text-to-Text • 15B • Updated Apr 23, 2025 • 5.44k

April 16 Releases

giskardai/realharm

Viewer • Updated Apr 16, 2025 • 136 • 46 • 12
Junfeng5/Liquid_V1_7B

Any-to-Any • Updated Mar 20, 2025 • 2.41k • 94

Multimodal DSE Retrievers

A collection of DSE models for multimodal retrieval

racineai/Flantier-SmolVLM-2B-dse

2B • Updated Jun 18, 2025 • 4 • 11
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26, 2025 • 20.1k • 68
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 6.18k • 56
llamaindex/vdr-2b-multi-v1

Image-Text-to-Text • 2B • Updated 6 days ago • 1.57k • 128

April 11 Releases

moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jan 30 • 104k • 447
agentica-org/DeepCoder-14B-Preview

Text Generation • Updated May 11, 2025 • 412 • • 680
HiDream-ai/HiDream-I1-Full

Text-to-Image • Updated Jul 17, 2025 • 23.1k • • 989
OpenGVLab/InternVL3-78B

Image-Text-to-Text • Updated Sep 11, 2025 • 40.2k • 234

March 28 Releases

deepseek-ai/DeepSeek-V3-0324

Text Generation • 685B • Updated Mar 27, 2025 • 537k • • 3.1k
Qwen/Qwen2.5-Omni-7B

Any-to-Any • Updated Apr 30, 2025 • 458k • 1.89k
google/txgemma-27b-chat

Text Generation • 27B • Updated Apr 10, 2025 • 95 • 59
Running

Featured

371

Qwen2.5 Omni 7B Demo

🏆

371

Chat with AI using text, audio, images, and video

March 21 Releases

docling-project/SmolDocling-256M-preview

Image-Text-to-Text • Updated Sep 17, 2025 • 51.9k • 1.61k
sesame/csm-1b

Text-to-Speech • Updated Dec 1, 2025 • 154k • 2.36k
mistralai/Mistral-Small-3.1-24B-Instruct-2503

Updated Dec 22, 2025 • 532k • 1.36k
tencent/Hunyuan3D-2mini

Image-to-3D • Updated Oct 17, 2025 • 14.7k • 129

Türkçe VLMler

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.2M • 1.27k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • Updated Jan 12, 2025 • 2.36M • 498
CohereLabs/aya-vision-8b

Image-Text-to-Text • 9B • Updated Jan 9 • 82.2k • 317
CohereLabs/aya-vision-32b

Image-Text-to-Text • Updated Jan 9 • 60 • • 224

Feb 14 Releases 💌

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 13.9k • 89
AIDC-AI/Ovis2-34B

Image-Text-to-Text • 35B • Updated Aug 15, 2025 • 301 • 142
open-r1/OpenR1-Qwen-7B

Text Generation • 8B • Updated May 28, 2025 • 35 • • 54
nomic-ai/nomic-embed-text-v2-moe

Sentence Similarity • 0.5B • Updated Apr 1, 2025 • 2.03M • 465

Feb 7 Releases 🧣

lerobot/pi0_old

Robotics • 4B • Updated Sep 19, 2025 • 5.29k • 307
kyutai/hibiki-2b-pytorch-bf16

Translation • Updated May 28, 2025 • 24 • 61
Alpha-VLLM/Lumina-Image-2.0

Text-to-Image • Updated Mar 30, 2025 • 1.49k • • 358
adyen/DABstep

Viewer • Updated about 2 hours ago • 460 • 4.15k • 44

January 31 Releases 🧤

allenai/Llama-3.1-Tulu-3-405B

Text Generation • Updated Feb 10, 2025 • 2.11k • 111
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6, 2025 • 108k • • 606
mistralai/Mistral-Small-24B-Instruct-2501

Updated Jul 28, 2025 • 114k • 950
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1, 2025 • 64k • 3.57k

Models, Jan 27

Running on Zero

266

Qwen2-VL-7B

🔥

266

Answer questions about your images
Running

66

UI-TARS

🌖

66

Find click coordinates on images based on instructions
Running

100

Qwen2.5-1M Demo

💻

100

Ask questions about your uploaded documents
Qwen/Qwen2.5-14B-Instruct-1M

Text Generation • 15B • Updated Jan 29, 2025 • 4.51k • • 332

Jan 24 Releases

ostris/Flex.1-alpha

Text-to-Image • Updated Jan 19, 2025 • 330 • 481
Qwen/Qwen2.5-Math-PRM-72B

Text Classification • 73B • Updated Jan 17, 2025 • 148 • 73
HuggingFaceTB/SmolVLM-500M-Instruct

Image-Text-to-Text • 0.5B • Updated Apr 8, 2025 • 48.5k • 192
deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27, 2025 • 3.51M • • 13.2k

Jan 17 Releases ❄️

Models and datasets of the second week of Jan 2025.

openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5, 2025 • 111k • 1.29k
MiniMaxAI/MiniMax-Text-01

Text Generation • Updated Jul 3, 2025 • 11.8k • 652
OuteAI/OuteTTS-0.3-1B

Text-to-Speech • Updated Apr 24, 2025 • 132 • 108
NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14, 2025 • 16.4k • 3.17k • 186

Jan 10 Releases 🌨️

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23, 2025 • 2.67M • 1.4k
DAMO-NLP-SG/multimodal_textbook

Updated Mar 17, 2025 • 1.4k • 159
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 609 • 29
nvidia/Cosmos-1.0-Autoregressive-4B

Updated Feb 11, 2025 • 18 • 56

Dec 6 Releases 🎄

meta-llama/Llama-3.3-70B-Instruct

Text Generation • 71B • Updated Dec 21, 2024 • 445k • • 2.7k
Qwen/Qwen2-VL-72B

Image-Text-to-Text • 73B • Updated Dec 6, 2024 • 286 • 80
google/paligemma2-3b-pt-224

Image-Text-to-Text • Updated Dec 5, 2024 • 18.6k • 167
tencent/HunyuanVideo

Text-to-Video • Updated Mar 6, 2025 • 1.27k • • 2.15k

Nov 29 Releases 🌲🌲

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8, 2025 • 29.6k • 583
Qwen/QwQ-32B-Preview

Text Generation • 33B • Updated Jan 12, 2025 • 8.07k • • 1.74k
nvidia/Hymba-1.5B-Base

Text Generation • 2B • Updated Nov 26, 2025 • 434 • 157
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14, 2025 • 61 • 55

Nov 22 Releases ❄️

mistralai/Pixtral-Large-Instruct-2411

Updated Jul 28, 2025 • 282 • 432
microsoft/orca-agentinstruct-1M-v1

Viewer • Updated Nov 1, 2024 • 1.05M • 1.08k • 460
Xkev/Llama-3.2V-11B-cot

Image-Text-to-Text • 11B • Updated Nov 16, 2025 • 5.25k • 158
jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated 6 days ago • 56.8k • 330

Nov 15 Releases 🍂

microsoft/LLM2CLIP-EVA02-L-14-336

Zero-Shot Image Classification • Updated Nov 22, 2024 • 72 • 61
microsoft/LLM2CLIP-EVA02-B-16

Updated Feb 8, 2025 • 87 • 11
PleIAs/common_corpus

Viewer • Updated Feb 19 • 69.9k • 242k • 390
Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • 33B • Updated Jan 12, 2025 • 1.13M • • 2k

Nov 1 Releases

Running on Zero

88

LongVU

🌖

88

Generate responses to video or image inputs
facebook/MobileLLM-1B

Text Generation • Updated May 5, 2025 • 222 • 122
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28, 2025 • 189 • 76
Vision-CAIR/LongVU_Llama3_2_3B_img

Updated Feb 28, 2025 • 2 • 6

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 74
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121

October 25 Releases

ibm-granite/granite-3.0-8b-instruct

Text Generation • Updated Dec 19, 2024 • 23k • 206
ibm-granite/granite-3.0-2b-instruct

Text Generation • 3B • Updated Dec 19, 2024 • 5.05k • 49
CohereLabs/aya-expanse-8b

Text Generation • 8B • Updated Jan 9 • 17.8k • 424
CohereLabs/aya-expanse-32b

Text Generation • 32B • Updated Jan 9 • 8.11k • • 291

LOTUS 🪷

Runtime error

Featured

101

Lotus Normal

🌍

101

Official Demo of Lotus (https://lotus3d.github.io/)
Runtime error

78

Lotus Depth

🚀

78

Official Demo of Lotus (https://lotus3d.github.io/)
jingheya/lotus-depth-g-v1-0

Depth Estimation • Updated Oct 5, 2024 • 9.74k • 27
jingheya/lotus-depth-d-v1-0

Depth Estimation • Updated Oct 5, 2024 • 420 • 5

New Depth Models

Recent depth models

Running on Zero

Featured

207

DepthCrafter

🦀

207

a super consistent video depth model
Paused

Featured

223

Depth Pro

🚀

223

Generate an inverse depth map from an image
Runtime error

78

Lotus Depth

🚀

78

Official Demo of Lotus (https://lotus3d.github.io/)
apple/DepthPro

Depth Estimation • Updated Feb 28, 2025 • 2.72k • 507

BRAVE Models 🦁

Models mentioned in https://huggingface.co/papers/2404.07204

facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 1.33M • 105
google/flan-t5-xl

3B • Updated Nov 28, 2023 • 94.8k • 532
google/siglip-large-patch16-384

Zero-Shot Image Classification • 0.7B • Updated Sep 26, 2024 • 45.4k • 11
google/vit-huge-patch14-224-in21k

Image Feature Extraction • 0.6B • Updated Feb 14, 2024 • 2.99k • 22

Computer Vision Backbones 🧩

Collection of useful computer vision backbones to fine-tune. It also includes large image classification models, that can be used as backbone.

microsoft/resnet-50

Image Classification • Updated Feb 13, 2024 • 247k • • 490
google/vit-base-patch16-224-in21k

Image Feature Extraction • 86.4M • Updated Feb 5, 2024 • 6.05M • 404
google/vit-base-patch32-224-in21k

Image Feature Extraction • 88M • Updated Dec 8, 2022 • 10.6k • 19
facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 1.33M • 105

Image Classification Models 🐶 🐱

facebook/deit-base-distilled-patch16-384

Image Classification • 87.6M • Updated Sep 12, 2023 • 190k • • 8
facebook/convnextv2-base-1k-224

Image Classification • 88.7M • Updated Feb 17, 2025 • 5.43k • • 4
facebook/deit-base-distilled-patch16-224

Image Classification • Updated Jul 13, 2022 • 6.24k • • 33
google/vit-base-patch32-384

Image Classification • 88.3M • Updated Sep 11, 2023 • 4.49k • • 23

Object Detection Models 🥥

facebook/detr-resnet-50

Object Detection • 41.6M • Updated Apr 10, 2024 • 222k • • 943
facebook/detr-resnet-101-dc5

Object Detection • 60.7M • Updated Sep 6, 2023 • 2.02k • 19
facebook/detr-resnet-50-dc5

Object Detection • 41.6M • Updated Sep 7, 2023 • 3.93k • 6
google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 167k • 148

Image Segmentation Models 💜

A collection of instance/semantic/panoptic segmentation models.

facebook/maskformer-swin-large-coco

Image Segmentation • 0.2B • Updated Sep 11, 2023 • 408 • 27
nvidia/segformer-b0-finetuned-ade-512-512

Image Segmentation • 3.75M • Updated Jan 14, 2024 • 568k • • 184
facebook/detr-resnet-50-dc5-panoptic

Image Segmentation • 43M • Updated Sep 11, 2023 • 48 • 3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024

Image Segmentation • Updated Aug 9, 2022 • 38.1k • • 41

Zero-shot Image Classification Models 🖼️

This is a collection for models that can be used for zero-shot image classification.

openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 28.7M • 1.99k
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 20.7M • 910
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Zero-Shot Image Classification • Updated Jan 22, 2025 • 70.9k • 309
kakaobrain/align-base

Zero-Shot Image Classification • Updated Mar 8, 2023 • 13.6k • 31

Image-to-Image Models 🎨

Collection of image to image editing, image enhancement (SR, deblur, brighten) and text-to-image adapter models.

timbrooks/instruct-pix2pix

Image-to-Image • Updated Jul 5, 2023 • 40k • 1.17k
TencentARC/t2i-adapter-canny-sdxl-1.0

Image-to-Image • Updated Sep 7, 2023 • 3.55k • 52
TencentARC/t2i-adapter-sketch-sdxl-1.0

Image-to-Image • Updated Sep 8, 2023 • 4.54k • 75
CrucibleAI/ControlNetMediaPipeFace

Image-to-Image • Updated May 19, 2023 • 895 • 576

Video Classification Models 📺

microsoft/xclip-base-patch32

Video Classification • 0.2B • Updated Feb 4, 2024 • 233k • 109
facebook/timesformer-base-finetuned-k400

Video Classification • Updated Jan 2, 2023 • 16.1k • 43
facebook/timesformer-base-finetuned-k600

Video Classification • Updated Dec 12, 2022 • 19.8k • 12
google/vivit-b-16x2

Video Classification • Updated Aug 3, 2023 • 16.4k • 11

Image-to-Text Models 📝

This collection contains image captioning and OCR models.

Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 1.52M • 1.47k
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 2.26M • 847
microsoft/trocr-base-handwritten

Image-to-Text • 0.3B • Updated Feb 11, 2025 • 154k • 489
microsoft/git-large-coco

Image-to-Text • 0.4B • Updated Jun 26, 2023 • 4.01k • 105

Text-to-Image Models 🥑

stabilityai/stable-diffusion-xl-base-1.0

Text-to-Image • Updated Oct 30, 2023 • 1.93M • • 7.61k
warp-ai/wuerstchen

Text-to-Image • Updated Mar 12, 2024 • 171 • 176
Deci/DeciDiffusion-v1-0

Text-to-Image • Updated Feb 15, 2024 • 17 • 140
stabilityai/stable-diffusion-xl-refiner-1.0

Image-to-Image • Updated Sep 25, 2023 • 246k • 2.03k

Foundation Models for Vision 🧩

Foundation models for computer vision.

Running

120

Grounding DINO Demo

💻

120

Cutting edge open-vocabulary object detection app
Running

Featured

103

Owlv2

👀

103

State-of-the-art Zero-shot Object Detection
Configuration error

Featured

41

BLIP2 with transformers

🌖

41

BLIP2 (cutting edge image captioning) in 🤗transformers
Build error

Featured

377

IDEFICS Playground

🐨

377

Segment Anything Model

This collection contains models and demos of SAM and it's smaller friends.

facebook/sam-vit-huge

Mask Generation • 0.6B • Updated Jan 11, 2024 • 192k • 192
facebook/sam-vit-base

Mask Generation • 93.7M • Updated Jan 11, 2024 • 386k • 165
facebook/sam-vit-large

Mask Generation • 0.3B • Updated Jan 11, 2024 • 22.3k • 33
Runtime error

43

Grounded SAM

💩

43

OWL-series 🦉

Models and applications of OWL-ViT and OWLv2.

Running

Featured

103

Owlv2

👀

103

State-of-the-art Zero-shot Object Detection
Runtime error

Featured

64

Owl Tracking

⚡

64

Powerful foundation model for zero-shot object tracking
Running

26

Search and Detect (CLIP/OWL-ViT)

🦉

26

Search and detect objects in images using text queries
Running on Zero

Featured

110

OWLSAM

😻

110

State-of-the-art open-vocabulary image segmentation ⚡️

SigLIP

A collection dedicated to SigLIP applications

Running on Zero

Featured

73

Draw To Search Art

🐠

73

Draw/upload image and search among WikiART using SigLIP
Running on CPU Upgrade

23

Compare Clip Siglip

🏃

23

Compare strong zero-shot image classification models
Running on Zero

13

Multilingual Zero Shot Image Clf

🏢

13

Comparing powerful multilingual zero-shot image clf models
BAAI/bunny-phi-2-siglip-lora

Text Generation • Updated Mar 28, 2024 • 63 • 48

Awesome Document AI

A collection of open-source document AI 📄 📝 📈

Runtime error

Featured

84

UDOP

🏃

84

Generate text from document images
Configuration error

40

Pix2struct

📚

40

Play with all the pix2struct variants in this d
Running

26

Compare Docvqa Models

🦀

26

Compare different visual question answering
Runtime error

Featured

290

DocQuery — Document Query Engine

🦉

290

SegGPT

A collection of everything SegGPT.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Paper • 2212.02499 • Published Dec 5, 2022
SegGPT: Segmenting Everything In Context

Paper • 2304.03284 • Published Apr 6, 2023 • 1
BAAI/seggpt-vit-large

0.4B • Updated Feb 22, 2024 • 20.8k • 5
BAAI/SegGPT

Updated Apr 21, 2023 • 19

Vision Language Models Papers 🖼️💬📝

Papers about vision-language models, most important ones are on top of the list.

Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 39
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 49
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 11
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29, 2024 • 27

gvhf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 167k • 148
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 11.8k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 13.3k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 25.9k • 30

gv-hf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 167k • 148
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 11.8k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 13.3k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 25.9k • 30

merve/owl2

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 167k • 148
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 11.8k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 13.3k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 25.9k • 30

Depth Anything v2 Release

A comprehensive collection on DAv2

depth-anything/Depth-Anything-V2-Small

Depth Estimation • Updated Jul 8, 2024 • 10.2k • 77
depth-anything/Depth-Anything-V2-Large

Depth Estimation • Updated Jul 8, 2024 • 96.9k • 154
Running on Zero

656

Depth Anything V2

🌖

656

Generate depth maps from your photos
depth-anything/DA-2K

Viewer • Updated Jun 14, 2024 • 1.04k • 361 • 17

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Vision Language Leaderboards

This collection has all the vision language leaderboards.

Running

203

Vidore Leaderboard

🥇

203

Browse and compare visual document retrieval model scores
Running on CPU Upgrade

1.01k

Open VLM Leaderboard

🌎

1.01k

VLMEvalKit Evaluation Results Collection
Running

Featured

562

Vision Arena (Testing VLMs side-by-side)

🖼

562

Explore Vision Arena’s computer‑vision tools online
Build error

Featured

85

SEED-Bench Leaderboard

🏆

85

Submit model evaluation results to leaderboard

Video Language Models

A collection of video-language models

Paused

21

Video Llava

🐨

21

Generate descriptions by uploading images or videos
llava-hf/LLaVA-NeXT-Video-7B-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 87.8k • 121
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 398 • 11
llava-hf/LLaVA-NeXT-Video-7B-32K-hf

Image-Text-to-Text • 8B • Updated Nov 11, 2025 • 215 • 8

SAM2

All the models and demos for SAM2

merve/sam2-hiera-tiny

Mask Generation • Updated Aug 2, 2024 • 28
merve/sam2-hiera-small

Mask Generation • Updated Aug 2, 2024 • 29 • 2
merve/sam2-hiera-large

Mask Generation • Updated Aug 2, 2024 • 32 • 2
merve/sam2-hiera-base-plus

Mask Generation • Updated Aug 2, 2024 • 50

NVEagle

NVEagle/Eagle-X5-13B

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 19 • 15
NVEagle/Eagle-X5-13B-Chat

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 13 • 28
NVEagle/Eagle-X5-7B

Image-Text-to-Text • 9B • Updated Sep 16, 2024 • 18 • 26
Runtime error

64

Eagle X5 13B Chat

🚀

64

Combine text and images to generate responses

Multimodal RAG

vidore/colpali-v1.2

Visual Document Retrieval • Updated Mar 14, 2025 • 30.4k • 112
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.2M • 1.27k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • Updated Jan 12, 2025 • 2.36M • 498
Qwen/Qwen2-72B-Instruct

Text Generation • 73B • Updated Oct 8, 2024 • 70.1k • • 718

Zero-shot Segmentation

sam-hq-team/SegInW

Updated Jul 13, 2023 • 1
xdecoder/X-Decoder

Updated Dec 27, 2023 • 5
xdecoder/SEEM

Updated Dec 30, 2023 • 8
Runtime error

Featured

60

OWLSAM2

🏃

60

merve PRO

AI & ML interests

Recent Activity

Organizations

merve 's collections 90

YOLO26

YOLO26 WebGPU

SAM3 Video Segmentation

SAM3 Tracker WebGPU

Multimodal OCR2

Lotus Normal

Lotus Depth

Grounding DINO Demo

Owlv2

BLIP2 with transformers

IDEFICS Playground