X-CLIP (base, patch 32) β€” acaua mirror

MIT-licensed mirror hosted under CondadosAI/ for use with the acaua computer vision library.

This is a safetensors-only mirror of the upstream Microsoft weights at the pinned commit shown below. The legacy pytorch_model.bin (pickle format) that upstream ships alongside model.safetensors has been deliberately removed for security hygiene β€” pickle loads can execute arbitrary code, and transformers auto-prefers safetensors when both are present, so removing it has zero functional impact on downstream users.

X-CLIP is a zero-shot video classification model: you provide a list of candidate text labels at inference time and the model ranks them by similarity to the video clip. It is not a closed-set softmax classifier, and it does not appear in AutoModelForVideoClassification.

Provenance

Upstream repo microsoft/xclip-base-patch32
Upstream commit SHA a2e27a78a2b5d802e894b8a1ef14f3a8ce490963
Upstream commit date 2024-02-04
Declared license MIT
Paper Ni et al., "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV 2022, arXiv:2208.02816
Official code microsoft/VideoX (MIT)
Mirrored on 2026-04-23
Mirrored by CondadosAI/acaua

Usage via acaua

import acaua

model = acaua.Model.from_pretrained(
    "CondadosAI/xclip_base_patch32",
    allow_non_apache=True,  # weights are MIT, not Apache-2.0
)
result = model.predict(
    "dance.mp4",
    labels=["dancing", "cooking", "running", "sleeping", "walking"],
    top_k=3,
)
for label, score in zip(result.labels, result.scores.tolist()):
    print(f"{label}: {score:.3f}")

Usage via πŸ€— Transformers

This mirror is drop-in compatible with the upstream repo.

from transformers import XCLIPModel, XCLIPProcessor

processor = XCLIPProcessor.from_pretrained("CondadosAI/xclip_base_patch32")
model = XCLIPModel.from_pretrained("CondadosAI/xclip_base_patch32")

Expected input

  • Frames: 8 uniformly-sampled frames per clip (vision_config.num_frames=8).
  • Resolution: 224 Γ— 224 after resize + center-crop.
  • Normalization: ImageNet mean/std (handled by XCLIPProcessor).
  • Text prompts: supplied at inference time β€” any natural-language strings.

License and attribution

Redistributed under MIT, consistent with the upstream declaration. See NOTICE for required attribution.

Citation

@inproceedings{ni2022expanding,
  title={Expanding language-image pretrained models for general video recognition},
  author={Ni, Bolin and Peng, Houwen and Chen, Minghao and Zhang, Songyang and Meng, Gaofeng and Fu, Jianlong and Xiang, Shiming and Ling, Haibin},
  booktitle={European Conference on Computer Vision (ECCV)},
  pages={1--18},
  year={2022},
  publisher={Springer}
}
Downloads last month
53
Safetensors
Model size
0.2B params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for CondadosAI/xclip_base_patch32

Finetuned
(3)
this model

Collection including CondadosAI/xclip_base_patch32

Paper for CondadosAI/xclip_base_patch32