DistillCLIP

This model is a distilled version of CLIP-ViT-B/32 distilled with Conceptual Captions 3M. It achieves the following results on the evaluation set:

Loss: 0.0064
Intra-modal Loss: 0.0056
Inter-modal Loss: 0.0008

Model description

DistillCLIP is a distilled version of CLIP. Specficially, the teacher model was a CLIP-ViT-B/32.

The knowledge distillation scheme of CLIP is presented below:

CLIP is distilled with two losses: $L_{inter}$ and $L_{intra}$. These losses respectively distill the inter-modal (image-text) and intra-modal (image-image, text-text) similarity maps with MSE losses. The final distillation loss is the sum of the two losses, or $L = L_{inter} + L_{intra}$.

The image encoder is a ViT-S/16 while the text encoder is a 6-layer Transformer encoder. At the start of training the image encoder was initialized with ImageNet-21K pretrained weights while the text encoder was initialized with every odd indexed layer of the teacher text encoder (assuming layers are zero-indexed).

Intended uses & limitations

Primary intended uses

Research on vision-language models e.g. natural language supervised image classification, visual question answering, text-to-image synthesis

Primary intended users

Researchers in the field of vision-language representation learning

Out-of-scope use cases

In-the-wild applications e.g. industrial deployment

Training and evaluation data

The model was trained and evaluated on Conceptual Captions 3M.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 84
eval_batch_size: 84
seed: 42
optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 10000
training_steps: 33513
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Intra-modal Loss	Intra-modal Loss
0.0259	0.01	500	0.0223	0.0194	0.0029
0.0197	0.03	1000	0.0178	0.0152	0.0026
0.017	0.04	1500	0.0153	0.0129	0.0023
0.0153	0.06	2000	0.0133	0.0112	0.0021
0.0142	0.07	2500	0.0135	0.0116	0.0019
0.0134	0.09	3000	0.0138	0.0119	0.0018
0.0127	0.1	3500	0.0117	0.0099	0.0018
0.012	0.12	4000	0.0116	0.0099	0.0017
0.0115	0.13	4500	0.0113	0.0097	0.0016
0.0111	0.15	5000	0.0112	0.0098	0.0014
0.0108	0.16	5500	0.0112	0.0097	0.0015
0.0106	0.18	6000	0.0107	0.0093	0.0014
0.0105	0.19	6500	0.0102	0.0089	0.0013
0.0101	0.21	7000	0.0100	0.0087	0.0013
0.0098	0.22	7500	0.0101	0.0089	0.0013
0.0098	0.24	8000	0.0100	0.0088	0.0013
0.0098	0.25	8500	0.0100	0.0089	0.0012
0.0094	0.27	9000	0.0095	0.0084	0.0011
0.0092	0.28	9500	0.0092	0.0080	0.0011
0.0091	0.3	10000	0.0097	0.0086	0.0011
0.0091	0.31	10500	0.0098	0.0087	0.0011
0.0087	0.33	11000	0.0090	0.0079	0.0011
0.0085	0.34	11500	0.0089	0.0079	0.0010
0.0088	0.36	12000	0.0086	0.0075	0.0010
0.0082	0.37	12500	0.0084	0.0075	0.0010
0.0082	0.39	13000	0.0080	0.0070	0.0009
0.008	0.4	13500	0.0080	0.0071	0.0010
0.008	0.42	14000	0.0088	0.0078	0.0010
0.0078	0.43	14500	0.0086	0.0076	0.0010
0.0077	0.45	15000	0.0081	0.0071	0.0010
0.0076	0.46	15500	0.0077	0.0068	0.0009
0.0075	0.48	16000	0.0076	0.0067	0.0009
0.0074	0.49	16500	0.0075	0.0066	0.0009
0.0072	0.51	17000	0.0070	0.0061	0.0009
0.0072	0.52	17500	0.0075	0.0066	0.0009
0.0071	0.54	18000	0.0072	0.0063	0.0009
0.0071	0.55	18500	0.0071	0.0063	0.0009
0.007	0.57	19000	0.0076	0.0067	0.0009
0.0069	0.58	19500	0.0074	0.0065	0.0009
0.0068	0.6	20000	0.0067	0.0059	0.0009
0.0069	0.61	20500	0.0067	0.0058	0.0008
0.0067	0.63	21000	0.0069	0.0061	0.0008
0.0067	0.64	21500	0.0071	0.0062	0.0008
0.0065	0.66	22000	0.0069	0.0061	0.0008
0.0065	0.67	22500	0.0066	0.0058	0.0008
0.0065	0.69	23000	0.0070	0.0062	0.0008
0.0064	0.7	23500	0.0068	0.0059	0.0008
0.0064	0.72	24000	0.0064	0.0056	0.0008
0.0063	0.73	24500	0.0066	0.0058	0.0008
0.0063	0.75	25000	0.0065	0.0057	0.0008
0.0062	0.76	25500	0.0066	0.0058	0.0008
0.0062	0.78	26000	0.0064	0.0056	0.0008
0.0062	0.79	26500	0.0065	0.0057	0.0008
0.0061	0.81	27000	0.0065	0.0057	0.0008
0.0061	0.82	27500	0.0063	0.0055	0.0008
0.0059	0.84	28000	0.0064	0.0057	0.0008
0.006	0.85	28500	0.0064	0.0056	0.0008
0.006	0.87	29000	0.0065	0.0057	0.0008
0.006	0.88	29500	0.0065	0.0057	0.0008
0.006	0.9	30000	0.0065	0.0057	0.0008
0.006	0.91	30500	0.0064	0.0056	0.0008
0.0059	0.93	31000	0.0064	0.0056	0.0008
0.006	0.94	31500	0.0064	0.0056	0.0008
0.0059	0.95	32000	0.0064	0.0056	0.0008
0.0058	0.97	32500	0.0064	0.0056	0.0008
0.0059	0.98	33000	0.0064	0.0056	0.0008
0.0059	1.0	33500	0.0064	0.0056	0.0008

Framework versions

Transformers 4.29.2
Pytorch 2.0.0
Datasets 2.13.1
Tokenizers 0.13.3

Ramos-Ramos
/

distillclip