🖼️➡️📚 Image-Text-to-Text - a Susant-Achary Collection

Susant-Achary 's Collections

🛩️Qwen3-VL

<7B Best of MoE 🧠

🍎 MLX-Quantized Models (3/4/5/6-bit) Mac & iOS

🖼️ Vision Backbones & Image Embeddings

Feature Extraction with 🧠 Text Embeddings

🧊Sept 25 <Image-to-3D> [Top Releases]

🪶 Sept’25 <Text Generation Language Models >(Top Releases)

🎬 ✍️ Sept 25 <Video & Text2Video> (Top Releases)

🖼️ **Text2Image, i2i ** September ’25 (Top Releases)

Top Apache 2.0 License

📄➡️🔊 Text-to-Speech (TTS)

✍️➡️🎬 Text-to-Video

📚➡️🎨Text-to-Image

🖌️ Image-to-Image

🎨➡️✍️ Image-to-Text

🖼️➡️📚 Image-Text-to-Text

🌀 Any-to-Any Multimodal Models

✍️ Text Generation

👨‍💻Mathematical Reasoning 🧮

🧠General Purpose Dataset < 10M samples

🧩 Long-Context Models (≥128k) CODING

🍎 MLX-Ready LLMs

🧩 Long-Context Models (≥128k) under 8B

📱 OnDevice -Ready SLMs (≤4B)

Qwen3

GPT2-JungleBook-from-Scratch-Models

🖼️➡️📚 Image-Text-to-Text

updated 22 days ago

Multimodal models that take image + text as input and produce natural language output. Use cases: chart QA, visual document reasoning, VQA.