DistillCLIP

This model is a distilled version of CLIP-ViT-B/32 distilled with Conceptual Captions 3M. It achieves the following results on the evaluation set:

  • Loss: 0.0064
  • Intra-modal Loss: 0.0056
  • Inter-modal Loss: 0.0008

Model description

DistillCLIP is a distilled version of CLIP. Specficially, the teacher model was a CLIP-ViT-B/32.

The knowledge distillation scheme of CLIP is presented below:

CLIP is distilled with two losses: $L_{inter}$ and $L_{intra}$. These losses respectively distill the inter-modal (image-text) and intra-modal (image-image, text-text) similarity maps with MSE losses. The final distillation loss is the sum of the two losses, or $L = L_{inter} + L_{intra}$.

The image encoder is a ViT-S/16 while the text encoder is a 6-layer Transformer encoder. At the start of training the image encoder was initialized with ImageNet-21K pretrained weights while the text encoder was initialized with every odd indexed layer of the teacher text encoder (assuming layers are zero-indexed).

Intended uses & limitations

Primary intended uses

Research on vision-language models e.g. natural language supervised image classification, visual question answering, text-to-image synthesis

Primary intended users

Researchers in the field of vision-language representation learning

Out-of-scope use cases

In-the-wild applications e.g. industrial deployment

Training and evaluation data

The model was trained and evaluated on Conceptual Captions 3M.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 84
  • eval_batch_size: 84
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 10000
  • training_steps: 33513
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Intra-modal Loss Intra-modal Loss
0.0259 0.01 500 0.0223 0.0194 0.0029
0.0197 0.03 1000 0.0178 0.0152 0.0026
0.017 0.04 1500 0.0153 0.0129 0.0023
0.0153 0.06 2000 0.0133 0.0112 0.0021
0.0142 0.07 2500 0.0135 0.0116 0.0019
0.0134 0.09 3000 0.0138 0.0119 0.0018
0.0127 0.1 3500 0.0117 0.0099 0.0018
0.012 0.12 4000 0.0116 0.0099 0.0017
0.0115 0.13 4500 0.0113 0.0097 0.0016
0.0111 0.15 5000 0.0112 0.0098 0.0014
0.0108 0.16 5500 0.0112 0.0097 0.0015
0.0106 0.18 6000 0.0107 0.0093 0.0014
0.0105 0.19 6500 0.0102 0.0089 0.0013
0.0101 0.21 7000 0.0100 0.0087 0.0013
0.0098 0.22 7500 0.0101 0.0089 0.0013
0.0098 0.24 8000 0.0100 0.0088 0.0013
0.0098 0.25 8500 0.0100 0.0089 0.0012
0.0094 0.27 9000 0.0095 0.0084 0.0011
0.0092 0.28 9500 0.0092 0.0080 0.0011
0.0091 0.3 10000 0.0097 0.0086 0.0011
0.0091 0.31 10500 0.0098 0.0087 0.0011
0.0087 0.33 11000 0.0090 0.0079 0.0011
0.0085 0.34 11500 0.0089 0.0079 0.0010
0.0088 0.36 12000 0.0086 0.0075 0.0010
0.0082 0.37 12500 0.0084 0.0075 0.0010
0.0082 0.39 13000 0.0080 0.0070 0.0009
0.008 0.4 13500 0.0080 0.0071 0.0010
0.008 0.42 14000 0.0088 0.0078 0.0010
0.0078 0.43 14500 0.0086 0.0076 0.0010
0.0077 0.45 15000 0.0081 0.0071 0.0010
0.0076 0.46 15500 0.0077 0.0068 0.0009
0.0075 0.48 16000 0.0076 0.0067 0.0009
0.0074 0.49 16500 0.0075 0.0066 0.0009
0.0072 0.51 17000 0.0070 0.0061 0.0009
0.0072 0.52 17500 0.0075 0.0066 0.0009
0.0071 0.54 18000 0.0072 0.0063 0.0009
0.0071 0.55 18500 0.0071 0.0063 0.0009
0.007 0.57 19000 0.0076 0.0067 0.0009
0.0069 0.58 19500 0.0074 0.0065 0.0009
0.0068 0.6 20000 0.0067 0.0059 0.0009
0.0069 0.61 20500 0.0067 0.0058 0.0008
0.0067 0.63 21000 0.0069 0.0061 0.0008
0.0067 0.64 21500 0.0071 0.0062 0.0008
0.0065 0.66 22000 0.0069 0.0061 0.0008
0.0065 0.67 22500 0.0066 0.0058 0.0008
0.0065 0.69 23000 0.0070 0.0062 0.0008
0.0064 0.7 23500 0.0068 0.0059 0.0008
0.0064 0.72 24000 0.0064 0.0056 0.0008
0.0063 0.73 24500 0.0066 0.0058 0.0008
0.0063 0.75 25000 0.0065 0.0057 0.0008
0.0062 0.76 25500 0.0066 0.0058 0.0008
0.0062 0.78 26000 0.0064 0.0056 0.0008
0.0062 0.79 26500 0.0065 0.0057 0.0008
0.0061 0.81 27000 0.0065 0.0057 0.0008
0.0061 0.82 27500 0.0063 0.0055 0.0008
0.0059 0.84 28000 0.0064 0.0057 0.0008
0.006 0.85 28500 0.0064 0.0056 0.0008
0.006 0.87 29000 0.0065 0.0057 0.0008
0.006 0.88 29500 0.0065 0.0057 0.0008
0.006 0.9 30000 0.0065 0.0057 0.0008
0.006 0.91 30500 0.0064 0.0056 0.0008
0.0059 0.93 31000 0.0064 0.0056 0.0008
0.006 0.94 31500 0.0064 0.0056 0.0008
0.0059 0.95 32000 0.0064 0.0056 0.0008
0.0058 0.97 32500 0.0064 0.0056 0.0008
0.0059 0.98 33000 0.0064 0.0056 0.0008
0.0059 1.0 33500 0.0064 0.0056 0.0008

Framework versions

  • Transformers 4.29.2
  • Pytorch 2.0.0
  • Datasets 2.13.1
  • Tokenizers 0.13.3
Downloads last month
3
Safetensors
Model size
66.1M params
Tensor type
I64
ยท
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Ramos-Ramos/distillclip 1