metadata

license: apache-2.0
tags:
  - image-captioning
  - multimodal
  - vision-language
  - diffusion
  - pytorch
  - transformers
library_name: transformers
pipeline_tag: image-to-text
datasets:
  - conceptual_captions
  - coco
model_type: VLV_decoder

VLV Captioner Model

This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images.

Model Description

The VLV Captioner is a multimodal model that:

Uses a diffusion-based vision encoder to extract image features
Employs the Qwen2.5-3B language model for text generation
Generates natural language descriptions of input images

Model Architecture

Vision Encoder: Stable Diffusion-based image encoder with Florence2 components
Language Model: Qwen2.5-3B transformer model
Image Size: 384x384 pixels
Max Caption Length: 300 tokens
Precision: Mixed precision (bfloat16/float32)

Usage

Method 1: Load from Hugging Face Hub

from transformers import AutoModel, AutoConfig
from PIL import Image
import torch
import os

# Optional: Set custom cache directory if needed
cache_dir = "/path/to/your/cache"  # Use a directory with sufficient space
os.makedirs(cache_dir, exist_ok=True)

# Load the model with authentication token (if required)
token = os.getenv('HUGGINGFACE_TOKEN')  # or your token string

print("Loading config...")
config = AutoConfig.from_pretrained(
    "your-username/vlv-captioner", 
    trust_remote_code=True, 
    token=token, 
    cache_dir=cache_dir
)

print("Loading model...")
try:
    model = AutoModel.from_pretrained(
        "your-username/vlv-captioner", 
        trust_remote_code=True, 
        token=token, 
        cache_dir=cache_dir,
        torch_dtype=torch.float32,  # Specify dtype explicitly
        low_cpu_mem_usage=True
        # Note: Avoid device_map="auto" to prevent meta tensor issues
    )
    print("Model loaded successfully!")
    
    # Load and process an image
    image = Image.open("path/to/your/image.jpg")
    
    # Move model to GPU if available
    if torch.cuda.is_available():
        model = model.to('cuda')
        print("Model moved to GPU!")
    
    # Generate caption
    print("Generating caption...")
    with torch.no_grad():
        captions = model([image], max_length=300)
        
        # Handle different possible output formats
        if hasattr(captions, 'generated_text'):
            print("Generated caption:", captions.generated_text[0])
        elif isinstance(captions, list):
            print("Generated caption:", captions[0])
        else:
            print("Generated caption:", captions)
            
except Exception as e:
    print(f"Error during model loading or inference: {e}")
    # If cached files are corrupted, try clearing cache and redownloading
    import shutil
    cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner"
    if os.path.exists(cache_path):
        print(f"Clearing cache at {cache_path}")
        shutil.rmtree(cache_path)
    
    # Retry with force download
    model = AutoModel.from_pretrained(
        "your-username/vlv-captioner", 
        trust_remote_code=True, 
        token=token, 
        cache_dir=cache_dir,
        force_download=True,
        torch_dtype=torch.float32
    )

Method 2: Load from original checkpoint

from VLV_stage2 import VLV_MODEL

# Load from original .pt checkpoint file
model = VLV_MODEL.from_checkpoint("path/to/model.pt")

# Load and process an image
image = Image.open("path/to/your/image.jpg")

# Generate caption
with torch.no_grad():
    captions = model([image], max_length=300)
    print(captions.generated_text[0])  # Generated caption

Model Details

Model Type: Vision-Language Model
Architecture: VLV_decoder
Language Backbone: Qwen/Qwen2.5-3B
Vision Backbone: Stable Diffusion + Florence2
Training Data: Various image-caption datasets
Framework: PyTorch, Transformers

Training Configuration

Batch Size: 1 (inference)
Learnable Token Length: 77
Guidance Scale: 7.5
Inference Steps: 50
Beam Search: 4 beams

Requirements

pip install torch transformers safetensors torchvision pillow diffusers

Troubleshooting

Common Issues and Solutions

1. Meta Tensor Issues

If you encounter meta tensor errors, avoid using device_map="auto" when loading the model:

# ❌ Don't use this - can cause meta tensor issues
model = AutoModel.from_pretrained("model-name", device_map="auto")

# ✅ Use this instead
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True)
if torch.cuda.is_available():
    model = model.to('cuda')

2. Cache Issues

If you experience corrupted cache files, clear the cache and redownload:

import shutil
import os

cache_dir = "/your/cache/directory"
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name"
if os.path.exists(cache_path):
    shutil.rmtree(cache_path)

# Then reload with force_download=True
model = AutoModel.from_pretrained("model-name", force_download=True)

3. Authentication Issues

Make sure your Hugging Face token is properly set:

# Option 1: Environment variable
export HUGGINGFACE_TOKEN="your_token_here"

# Option 2: Hugging Face CLI login
huggingface-cli login

4. Memory Issues

For large models, use a custom cache directory with sufficient space:

cache_dir = "/path/to/large/storage"
os.makedirs(cache_dir, exist_ok=True)
model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True)

Advanced Usage

Batch Processing with Original Inference Script

For large-scale inference, you can use the original training inference script:

python Caption_inference.py \
  --input_path /path/to/images \
  --output_path captions.json \
  --clip_decoder_checkpoint /path/to/model.pt \
  --qwen_model Qwen/Qwen2.5-3B \
  --stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \
  --florence2_model_path microsoft/Florence-2-large \
  --batch_size 4 \
  --max_length 300 \
  --num_beams 4 \
  --image_size 384 \
  --guidance_scale 7.5 \
  --use_text_encoder \
  --distributed  # For multi-GPU inference

Configuration Parameters

image_size: Input image resolution (default: 384)
guidance_scale: Diffusion guidance scale (default: 7.5)
learnable_token_length: Number of vision tokens (default: 77)
max_length: Maximum caption length (default: 300)
num_beams: Beam search width (default: 4)
use_text_encoder: Enable CLIP text encoder (recommended: True)


## Citation

```bibtex
@article{vlv_autoencoder,
  title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
  author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei},
  journal={arXiv preprint},
  year={2024}
}

License

This model is released under the Apache 2.0 license.