metadata
license: apache-2.0
tags:
- image-captioning
- multimodal
- vision-language
- diffusion
- pytorch
- transformers
library_name: transformers
pipeline_tag: image-to-text
datasets:
- conceptual_captions
- coco
model_type: VLV_decoder
VLV Captioner Model
This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images.
Model Description
The VLV Captioner is a multimodal model that:
- Uses a diffusion-based vision encoder to extract image features
- Employs the Qwen2.5-3B language model for text generation
- Generates natural language descriptions of input images
Model Architecture
- Vision Encoder: Stable Diffusion-based image encoder with Florence2 components
- Language Model: Qwen2.5-3B transformer model
- Image Size: 384x384 pixels
- Max Caption Length: 300 tokens
- Precision: Mixed precision (bfloat16/float32)
Usage
Method 1: Load from Hugging Face Hub
from transformers import AutoModel, AutoConfig
from PIL import Image
import torch
import os
# Optional: Set custom cache directory if needed
cache_dir = "/path/to/your/cache" # Use a directory with sufficient space
os.makedirs(cache_dir, exist_ok=True)
# Load the model with authentication token (if required)
token = os.getenv('HUGGINGFACE_TOKEN') # or your token string
print("Loading config...")
config = AutoConfig.from_pretrained(
"your-username/vlv-captioner",
trust_remote_code=True,
token=token,
cache_dir=cache_dir
)
print("Loading model...")
try:
model = AutoModel.from_pretrained(
"your-username/vlv-captioner",
trust_remote_code=True,
token=token,
cache_dir=cache_dir,
torch_dtype=torch.float32, # Specify dtype explicitly
low_cpu_mem_usage=True
# Note: Avoid device_map="auto" to prevent meta tensor issues
)
print("Model loaded successfully!")
# Load and process an image
image = Image.open("path/to/your/image.jpg")
# Move model to GPU if available
if torch.cuda.is_available():
model = model.to('cuda')
print("Model moved to GPU!")
# Generate caption
print("Generating caption...")
with torch.no_grad():
captions = model([image], max_length=300)
# Handle different possible output formats
if hasattr(captions, 'generated_text'):
print("Generated caption:", captions.generated_text[0])
elif isinstance(captions, list):
print("Generated caption:", captions[0])
else:
print("Generated caption:", captions)
except Exception as e:
print(f"Error during model loading or inference: {e}")
# If cached files are corrupted, try clearing cache and redownloading
import shutil
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner"
if os.path.exists(cache_path):
print(f"Clearing cache at {cache_path}")
shutil.rmtree(cache_path)
# Retry with force download
model = AutoModel.from_pretrained(
"your-username/vlv-captioner",
trust_remote_code=True,
token=token,
cache_dir=cache_dir,
force_download=True,
torch_dtype=torch.float32
)
Method 2: Load from original checkpoint
from VLV_stage2 import VLV_MODEL
# Load from original .pt checkpoint file
model = VLV_MODEL.from_checkpoint("path/to/model.pt")
# Load and process an image
image = Image.open("path/to/your/image.jpg")
# Generate caption
with torch.no_grad():
captions = model([image], max_length=300)
print(captions.generated_text[0]) # Generated caption
Model Details
- Model Type: Vision-Language Model
- Architecture: VLV_decoder
- Language Backbone: Qwen/Qwen2.5-3B
- Vision Backbone: Stable Diffusion + Florence2
- Training Data: Various image-caption datasets
- Framework: PyTorch, Transformers
Training Configuration
- Batch Size: 1 (inference)
- Learnable Token Length: 77
- Guidance Scale: 7.5
- Inference Steps: 50
- Beam Search: 4 beams
Requirements
pip install torch transformers safetensors torchvision pillow diffusers
Troubleshooting
Common Issues and Solutions
1. Meta Tensor Issues
If you encounter meta tensor errors, avoid using device_map="auto"
when loading the model:
# ❌ Don't use this - can cause meta tensor issues
model = AutoModel.from_pretrained("model-name", device_map="auto")
# ✅ Use this instead
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True)
if torch.cuda.is_available():
model = model.to('cuda')
2. Cache Issues
If you experience corrupted cache files, clear the cache and redownload:
import shutil
import os
cache_dir = "/your/cache/directory"
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name"
if os.path.exists(cache_path):
shutil.rmtree(cache_path)
# Then reload with force_download=True
model = AutoModel.from_pretrained("model-name", force_download=True)
3. Authentication Issues
Make sure your Hugging Face token is properly set:
# Option 1: Environment variable
export HUGGINGFACE_TOKEN="your_token_here"
# Option 2: Hugging Face CLI login
huggingface-cli login
4. Memory Issues
For large models, use a custom cache directory with sufficient space:
cache_dir = "/path/to/large/storage"
os.makedirs(cache_dir, exist_ok=True)
model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True)
Advanced Usage
Batch Processing with Original Inference Script
For large-scale inference, you can use the original training inference script:
python Caption_inference.py \
--input_path /path/to/images \
--output_path captions.json \
--clip_decoder_checkpoint /path/to/model.pt \
--qwen_model Qwen/Qwen2.5-3B \
--stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \
--florence2_model_path microsoft/Florence-2-large \
--batch_size 4 \
--max_length 300 \
--num_beams 4 \
--image_size 384 \
--guidance_scale 7.5 \
--use_text_encoder \
--distributed # For multi-GPU inference
Configuration Parameters
image_size
: Input image resolution (default: 384)guidance_scale
: Diffusion guidance scale (default: 7.5)learnable_token_length
: Number of vision tokens (default: 77)max_length
: Maximum caption length (default: 300)num_beams
: Beam search width (default: 4)use_text_encoder
: Enable CLIP text encoder (recommended: True)
## Citation
```bibtex
@article{vlv_autoencoder,
title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei},
journal={arXiv preprint},
year={2024}
}
License
This model is released under the Apache 2.0 license.