BayesVLM-Base

Project: https://aaltoml.github.io/BayesVLM/
Paper: https://arxiv.org/abs/2412.06014
GitHub: https://github.com/AaltoML/BayesVLM

Model summary

BayesVLM-Base is a post-hoc probabilistic version of CLIP ViT-B/32. It augments the standard CLIP image and text embeddings with estimated covariance, and returns mean, variance, and standard deviation for embeddings and logits. The base CLIP weights are unchanged; only additional covariance buffers are added.

Model details

  • Architecture: CLIP ViT-B/32 with projection dimension 512 and BayesVLM covariance buffers for text and vision projections.
  • Base model: laion/CLIP-ViT-B-32-laion2B-s34B-b79K.
  • Outputs: text_embeds, image_embeds, logits_per_image, plus corresponding variance and standard deviation fields.
  • Library: transformers with custom model code.

How it works

BayesVLM uses a post-hoc Laplace-style approximation around the projection layers of the CLIP model. The covariance is represented with Kronecker-factorized terms (stored as A_inv and B_inv), and inference uses these buffers to estimate uncertainty over embedding and logit outputs.

Intended use

  • Zero-shot image classification and retrieval with uncertainty estimates.
  • Uncertainty-aware downstream pipelines that consume CLIP embeddings.

Training data

This model is derived from the CLIP base model laion/CLIP-ViT-B-32-laion2B-s34B-b79K. BayesVLM itself is post-hoc and does not retrain CLIP weights. Covariance buffers are computed from Hessian-based approximations as described in the BayesVLM paper.

Evaluation

Evaluation results are reported in the BayesVLM paper. This model card does not include additional benchmarking.

Usage

from transformers import AutoModel, CLIPProcessor
import torch

model = AutoModel.from_pretrained("BayesVLM-Base", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("BayesVLM-Base")

inputs = processor(
    text=["a photo of a dog", "a photo of a cat"],
    images=[...],
    return_tensors="pt",
    padding=True,
)

with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

logits = outputs.logits_per_image
logits_var = outputs.logits_per_image_var

License

MIT.

Citation

@inproceedings{baumann2026bayesvlm,
  title     = {Post-hoc Probabilistic Vision-Language Models},
  author    = {Baumann, Anton and Li, Rui and Klasson, Marcus and Mentu, Santeri and Karthik, Shyamgopal and Akata, Zeynep and Solin, Arno and Trapp, Martin},
  booktitle = {International Conference on Learning Representations {(ICLR)}},
  year      = {2026},
}
Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aalto-ml/BayesVLM-Base

Finetuned
(3)
this model

Collection including aalto-ml/BayesVLM-Base

Paper for aalto-ml/BayesVLM-Base