BayesVLM-Base
Project: https://aaltoml.github.io/BayesVLM/
Paper: https://arxiv.org/abs/2412.06014
GitHub: https://github.com/AaltoML/BayesVLM
Model summary
BayesVLM-Base is a post-hoc probabilistic version of CLIP ViT-B/32. It augments the standard CLIP image and text embeddings with estimated covariance, and returns mean, variance, and standard deviation for embeddings and logits. The base CLIP weights are unchanged; only additional covariance buffers are added.
Model details
- Architecture: CLIP ViT-B/32 with projection dimension 512 and BayesVLM covariance buffers for text and vision projections.
- Base model:
laion/CLIP-ViT-B-32-laion2B-s34B-b79K. - Outputs:
text_embeds,image_embeds,logits_per_image, plus corresponding variance and standard deviation fields. - Library:
transformerswith custom model code.
How it works
BayesVLM uses a post-hoc Laplace-style approximation around the projection layers of the CLIP model. The covariance is represented with Kronecker-factorized terms (stored as A_inv and B_inv), and inference uses these buffers to estimate uncertainty over embedding and logit outputs.
Intended use
- Zero-shot image classification and retrieval with uncertainty estimates.
- Uncertainty-aware downstream pipelines that consume CLIP embeddings.
Training data
This model is derived from the CLIP base model laion/CLIP-ViT-B-32-laion2B-s34B-b79K. BayesVLM itself is post-hoc and does not retrain CLIP weights. Covariance buffers are computed from Hessian-based approximations as described in the BayesVLM paper.
Evaluation
Evaluation results are reported in the BayesVLM paper. This model card does not include additional benchmarking.
Usage
from transformers import AutoModel, CLIPProcessor
import torch
model = AutoModel.from_pretrained("BayesVLM-Base", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("BayesVLM-Base")
inputs = processor(
text=["a photo of a dog", "a photo of a cat"],
images=[...],
return_tensors="pt",
padding=True,
)
with torch.no_grad():
outputs = model(**inputs, return_dict=True)
logits = outputs.logits_per_image
logits_var = outputs.logits_per_image_var
License
MIT.
Citation
@inproceedings{baumann2026bayesvlm,
title = {Post-hoc Probabilistic Vision-Language Models},
author = {Baumann, Anton and Li, Rui and Klasson, Marcus and Mentu, Santeri and Karthik, Shyamgopal and Akata, Zeynep and Solin, Arno and Trapp, Martin},
booktitle = {International Conference on Learning Representations {(ICLR)}},
year = {2026},
}
- Downloads last month
- 31
Model tree for aalto-ml/BayesVLM-Base
Base model
laion/CLIP-ViT-B-32-laion2B-s34B-b79K