Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
VLV Captioner (Qwen 2.5 3B)
This repository hosts the 3-billion-parameter Vision-Language-Vision Captioner model, distantly supervised by diffusion models and built on top of Qwen 2.5 3B.
Checkpoint URL: https://huggingface.co/lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B
1 · Install Dependencies
# inside your virtualenv / conda env
pip install -r requirements.txt
2 · Example Usage
from transformers import AutoModel
from PIL import Image
import torch, numpy as np
MODEL_NAME = "lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B"
device = "cuda" if torch.cuda.is_available() else "cpu"
# ────── load model ──────
model = (
AutoModel.from_pretrained(
MODEL_NAME,
trust_remote_code=True,
low_cpu_mem_usage=False,
)
.to(device)
.eval()
)
# ────── helpers ──────
def _trim_tail(text: str) -> str:
"""Remove an incomplete trailing sentence fragment, if any."""
sentences = [s.strip() for s in text.split(".") if s.strip()]
if not text.rstrip().endswith("."):
sentences = sentences[:-1] # drop dangling fragment
return ". ".join(sentences) + ("." if sentences else "")
def caption_image(img: Image.Image, max_len: int = 77) -> str:
"""Generate a caption for one PIL image."""
with torch.no_grad():
raw = model([img], max_len).generated_text[0]
return _trim_tail(raw)
def caption_from_numpy(arr: np.ndarray, max_len: int = 77) -> str:
"""
Wrapper for NumPy arrays.
Accepts uint8 [0, 255] or float [0, 1] ranges.
"""
if arr.dtype != np.uint8:
arr = (np.clip(arr, 0, 1) * 255).astype(np.uint8)
return caption_image(Image.fromarray(arr, mode="RGB"), max_len)
3 · Quick Test
# caption a remote sample image (cat photo) in one cell
import io, requests
from PIL import Image
from IPython.display import display # Jupyter/Colab only
IMG_URL = "https://huggingface.co/datasets/huggingface/cats-image/resolve/main/cats_image.jpeg"
# download & open
img = Image.open(io.BytesIO(requests.get(IMG_URL, timeout=10).content)).convert("RGB")
display(img) # show the image
print(caption_image(img)) # generate and print the caption
4 · Citation
@article{zhang2025vision,
title = {Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
author = {Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan and Wei, Chen and Xiao, Junfei},
journal = {arXiv preprint arXiv:2507.07104},
year = {2025}
}
- Downloads last month
- 158
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support