BayesVLM-Base

Project: https://aaltoml.github.io/BayesVLM/
Paper: https://arxiv.org/abs/2412.06014
GitHub: https://github.com/AaltoML/BayesVLM

Model summary

BayesVLM-Base is a post-hoc probabilistic version of CLIP ViT-B/32. It augments the standard CLIP image and text embeddings with estimated covariance, and returns mean, variance, and standard deviation for embeddings and logits. The base CLIP weights are unchanged; only additional covariance buffers are added.

Model details

Architecture: CLIP ViT-B/32 with projection dimension 512 and BayesVLM covariance buffers for text and vision projections.
Base model: laion/CLIP-ViT-B-32-laion2B-s34B-b79K.
Outputs: text_embeds, image_embeds, logits_per_image, plus corresponding variance and standard deviation fields.
Library: transformers with custom model code.

How it works

BayesVLM uses a post-hoc Laplace-style approximation around the projection layers of the CLIP model. The covariance is represented with Kronecker-factorized terms (stored as A_inv and B_inv), and inference uses these buffers to estimate uncertainty over embedding and logit outputs.

Intended use

Zero-shot image classification and retrieval with uncertainty estimates.
Uncertainty-aware downstream pipelines that consume CLIP embeddings.

Training data

This model is derived from the CLIP base model laion/CLIP-ViT-B-32-laion2B-s34B-b79K. BayesVLM itself is post-hoc and does not retrain CLIP weights. Covariance buffers are computed from Hessian-based approximations as described in the BayesVLM paper.

Evaluation

Evaluation results are reported in the BayesVLM paper. This model card does not include additional benchmarking.

Usage

from transformers import AutoModel, CLIPProcessor
import torch

model = AutoModel.from_pretrained("BayesVLM-Base", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("BayesVLM-Base")

inputs = processor(
    text=["a photo of a dog", "a photo of a cat"],
    images=[...],
    return_tensors="pt",
    padding=True,
)

with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

logits = outputs.logits_per_image
logits_var = outputs.logits_per_image_var

License

MIT.

Citation

@inproceedings{baumann2026bayesvlm,
  title     = {Post-hoc Probabilistic Vision-Language Models},
  author    = {Baumann, Anton and Li, Rui and Klasson, Marcus and Mentu, Santeri and Karthik, Shyamgopal and Akata, Zeynep and Solin, Arno and Trapp, Martin},
  booktitle = {International Conference on Learning Representations {(ICLR)}},
  year      = {2026},
}

Downloads last month: 31

Model tree for aalto-ml/BayesVLM-Base

Base model

laion/CLIP-ViT-B-32-laion2B-s34B-b79K

Finetuned

(3)

this model

Collection including aalto-ml/BayesVLM-Base

BayesVLM

Collection

2 items • Updated 4 days ago

Paper for aalto-ml/BayesVLM-Base

Post-hoc Probabilistic Vision-Language Models

Paper • 2412.06014 • Published Dec 8, 2024 • 1