---
license: mit
language:
- uk
base_model:
- laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k
library_name: open_clip
datasets:
- turuta/Multi30k-uk
pipeline_tag: zero-shot-image-classification
---

## Model Details
- **Base Model:** [`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K)
- **Architecture:** Vision Transformer (ViT-H/14) + XLM-Roberta Large text encoder
- **Languages:** Multilingual, with a focus on Ukrainian
- **Developed by:** Yurii Laba and Volodymyr Mudriy, in affiliation with the Ukrainian Catholic University

This model is a fine-tuned OpenCLIP model that improves embedding stability for text-to-image retrieval under synonym substitution via synonym-augmented fine-tuning.

### Data & Training
Fine-tuning was performed on the Multi30K-Ukrainian training set, extended with synonym-augmented captions, and optimized with a CLIP contrastive loss.

- **Augmentation:** Each caption was expanded into several variants by substituting exactly one noun with a context-aware synonym.
- **Synonym Generation:** Synonyms were produced using GPT-4o, ensuring semantic, morphological, and grammatical correctness.
- **Images:** The paired image remained unchanged.
- **Final Corpus:** Original Multi30K-Ukrainian training set + synonym-augmented captions.

### Evaluation: Ukrainian Text-to-Image Retrieval (Multi30K-Ukrainian Test Set)

| Model | Unpert. (UA) | Unpert. (ENG) | SSA-Dict | SSA-GPT-4o | SSA-Hybrid |
| :------------: | :---------------: | :---------------: | :---------------: | :---------------: | :---------------: |
| OpenCLIP | 32.1 / 54.3 | 41.6 / 65.7 | 7.6 / 39.3 | 10.9 / 44.0 | 16.8 / 49.0 |
| Synonym FT | 39.07 / 63.76 | 45.77 / 69.79 | 19.78 / 51.57 | 25.14 / 56.36 | 28.08 / 58.94 |

We evaluated the model on the Multi30K-Ukrainian test set, comparing baseline OpenCLIP with our synonym fine-tuned variant.
Performance is reported as **HIT@1 / HIT@5** (higher is better), showing the proportion of times the correct image was retrieved in the top-1 and top-5 results.

- **Unperturbed (UA):** Original Ukrainian captions.
- **Unperturbed (ENG):** Original English captions (baseline).
- **SSA-Dict:** Synonym Substitution Attack using dictionary-based synonyms.
- **SSA-GPT-4o:** Synonym Substitution Attack using GPT-4o-generated synonyms.
- **SSA-Hybrid:** Mixed attack combining dictionary and GPT-4o synonyms.

### Usage Example

To use this model:

1. **Download the checkpoint:**  
   [`ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt`](https://huggingface.co/lang-uk/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/tree/main)

2. **Load it in OpenCLIP:**  

```python
import torch
from PIL import Image
import open_clip

pretrained_path = "ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt"

# Load model & preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14', pretrained=pretrained_path)
model.eval()
tokenizer = open_clip.get_tokenizer('xlm-roberta-large-ViT-H-14')

# Example inputs
image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

# Encode & normalize
with torch.no_grad(), torch.autocast("cuda"):
  image_features = model.encode_image(image)
  text_features = model.encode_text(text)
  image_features /= image_features.norm(dim=-1, keepdim=True)
  text_features /= text_features.norm(dim=-1, keepdim=True)

# Compute similarity
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
```
## Citation
If you use this model in your work, please cite:
TODO: will be added after EMNLP 2025