Contrastive Language-Image Pre-training (CLIP) is a training method and a class of models where pairs of images and natural-language captions are fed into a dual-encoder transformer to teach it how to associate images and language. A trained CLIP model can recognize which image and caption belong together.
And by dual-encoder, we mean:
- It has a vision transformer for its visual encoder
- It has a text encoder for a text encoder
Training
CLIP is trained with batches of pairs of image and text. The way it works is:
- You compute all image embeddings
- You compute all text embeddings
- You normalize those embeddings so they’re all the same length
- You compute a similarity matrix (cosine similarity)
Cosine similarity has a tiny dynamic range of , so a single scaling factor (which is learned) is applied. It is learned so that the model can calibrate the sharpness of the similarity distribution and confidently distinguish matching image-caption pairs from non-matching pairs.
CLIP ViT-L/14
Looking at OpenAI’s original model config:1
ViT-L/14
The vision encoder (ViT-L/14) has the following learned parameters:
-
Patch embedding (Conv2d, 3 → 1024, kernel ):
-
Class token:
-
Position embeddings :
-
Per transformer layer (hidden=1024, intermediate=4096):
- LayerNorm :
- Attention (Q/K/V/O, each ):
- MLP FC1:
- MLP FC2:
- Per layer total:
-
24 layers:
-
Final LayerNorm:
-
Visual projection (1024 → 768):
-
Vision subtotal:
Text encoder (no fancy name)
The text encoder has the following learned parameters:
-
Token embeddings:
-
Position embeddings (77 positions):
-
Per transformer layer (hidden=768, intermediate=3072):
- LayerNorm :
- Attention (Q/K/V/O, each ):
- MLP FC1:
- MLP FC2:
- Per layer total:
-
12 layers:
-
Final LayerNorm:
-
Text projection (768 → 768):
-
Text subtotal:
Grand Total
| Component | Parameters |
|---|---|
| Vision encoder | 303,965,184 |
| Text encoder | 123,650,304 |
| logit_scale | 1 |
| Total | ~427,615,489 |