--- license: mit language: - uk base_model: - laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k library_name: open_clip datasets: - turuta/Multi30k-uk pipeline_tag: zero-shot-image-classification --- ## Model Details - **Base Model:** [`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) - **Architecture:** Vision Transformer (ViT-H/14) + XLM-Roberta Large text encoder - **Languages:** Multilingual, with a focus on Ukrainian - **Developed by:** Yurii Laba and Volodymyr Mudriy, in affiliation with the Ukrainian Catholic University This model is a fine-tuned OpenCLIP model that improves embedding stability for text-to-image retrieval under synonym substitution via synonym-augmented fine-tuning. ### Data & Training Fine-tuning was performed on the Multi30K-Ukrainian training set, extended with synonym-augmented captions, and optimized with a CLIP contrastive loss. - **Augmentation:** Each caption was expanded into several variants by substituting exactly one noun with a context-aware synonym. - **Synonym Generation:** Synonyms were produced using GPT-4o, ensuring semantic, morphological, and grammatical correctness. - **Images:** The paired image remained unchanged. - **Final Corpus:** Original Multi30K-Ukrainian training set + synonym-augmented captions. ### Evaluation: Ukrainian Text-to-Image Retrieval (Multi30K-Ukrainian Test Set) | Model | Unpert. (UA) | Unpert. (ENG) | SSA-Dict | SSA-GPT-4o | SSA-Hybrid | | :------------: | :---------------: | :---------------: | :---------------: | :---------------: | :---------------: | | OpenCLIP | 32.1 / 54.3 | 41.6 / 65.7 | 7.6 / 39.3 | 10.9 / 44.0 | 16.8 / 49.0 | | Synonym FT | 39.07 / 63.76 | 45.77 / 69.79 | 19.78 / 51.57 | 25.14 / 56.36 | 28.08 / 58.94 | We evaluated the model on the Multi30K-Ukrainian test set, comparing baseline OpenCLIP with our synonym fine-tuned variant. Performance is reported as **HIT@1 / HIT@5** (higher is better), showing the proportion of times the correct image was retrieved in the top-1 and top-5 results. - **Unperturbed (UA):** Original Ukrainian captions. - **Unperturbed (ENG):** Original English captions (baseline). - **SSA-Dict:** Synonym Substitution Attack using dictionary-based synonyms. - **SSA-GPT-4o:** Synonym Substitution Attack using GPT-4o-generated synonyms. - **SSA-Hybrid:** Mixed attack combining dictionary and GPT-4o synonyms. ### Usage Example To use this model: 1. **Download the checkpoint:** [`ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt`](https://huggingface.co/lang-uk/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/tree/main) 2. **Load it in OpenCLIP:** ```python import torch from PIL import Image import open_clip pretrained_path = "ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/ukr-clip-vit-h-14-frozen-xlm-roberta-large-laion5B-s13B-b90k.pt" # Load model & preprocessing model, _, preprocess = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14', pretrained=pretrained_path) model.eval() tokenizer = open_clip.get_tokenizer('xlm-roberta-large-ViT-H-14') # Example inputs image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0) text = tokenizer(["a diagram", "a dog", "a cat"]) # Encode & normalize with torch.no_grad(), torch.autocast("cuda"): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # Compute similarity text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) ``` ## Citation If you use this model in your work, please cite: TODO: will be added after EMNLP 2025