Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Website arXiv GitHub HF Model HF Dataset

VLV Captioner (Qwen 2.5 3B)

This repository hosts the 3-billion-parameter Vision-Language-Vision Captioner model, distantly supervised by diffusion models and built on top of Qwen 2.5 3B.
Checkpoint URL: https://huggingface.co/lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B


1 · Install Dependencies

# inside your virtualenv / conda env
pip install -r requirements.txt

2 · Example Usage

from transformers import AutoModel
from PIL import Image
import torch, numpy as np

MODEL_NAME = "lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B"
device = "cuda" if torch.cuda.is_available() else "cpu"

# ────── load model ──────
model = (
    AutoModel.from_pretrained(
        MODEL_NAME,
        trust_remote_code=True,
        low_cpu_mem_usage=False,
    )
    .to(device)
    .eval()
)

# ────── helpers ──────
def _trim_tail(text: str) -> str:
    """Remove an incomplete trailing sentence fragment, if any."""
    sentences = [s.strip() for s in text.split(".") if s.strip()]
    if not text.rstrip().endswith("."):
        sentences = sentences[:-1]            # drop dangling fragment
    return ". ".join(sentences) + ("." if sentences else "")

def caption_image(img: Image.Image, max_len: int = 77) -> str:
    """Generate a caption for one PIL image."""
    with torch.no_grad():
        raw = model([img], max_len).generated_text[0]
    return _trim_tail(raw)

def caption_from_numpy(arr: np.ndarray, max_len: int = 77) -> str:
    """
    Wrapper for NumPy arrays.
    Accepts uint8 [0, 255] or float [0, 1] ranges.
    """
    if arr.dtype != np.uint8:
        arr = (np.clip(arr, 0, 1) * 255).astype(np.uint8)
    return caption_image(Image.fromarray(arr, mode="RGB"), max_len)

3 · Quick Test

# caption a remote sample image (cat photo) in one cell

import io, requests
from PIL import Image
from IPython.display import display  # Jupyter/Colab only

IMG_URL = "https://huggingface.co/datasets/huggingface/cats-image/resolve/main/cats_image.jpeg"

# download & open
img = Image.open(io.BytesIO(requests.get(IMG_URL, timeout=10).content)).convert("RGB")

display(img)                    # show the image
print(caption_image(img))       # generate and print the caption

4 · Citation

@article{zhang2025vision,
  title   = {Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
  author  = {Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan and Wei, Chen and Xiao, Junfei},
  journal = {arXiv preprint arXiv:2507.07104},
  year    = {2025}
}
Downloads last month
158
Safetensors
Model size
4.58B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B

Base model

Qwen/Qwen2.5-3B
Finetuned
(685)
this model

Space using lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B 1