Contrastive Language-Image Pre-training (CLIP) is a training method and a class of models where pairs of images and natural-language captions are fed into a dual-encoder transformer to teach it how to associate images and language. A trained CLIP model can recognize which image and caption belong together.

And by dual-encoder, we mean:

  • It has a vision transformer for its visual encoder
  • It has a text encoder for a text encoder

Training

CLIP is trained with batches of pairs of image and text. The way it works is:

  • You compute all image embeddings
  • You compute all text embeddings
  • You normalize those embeddings so they’re all the same length
  • You compute a similarity matrix (cosine similarity)

Cosine similarity has a tiny dynamic range of , so a single scaling factor (which is learned) is applied. It is learned so that the model can calibrate the sharpness of the similarity distribution and confidently distinguish matching image-caption pairs from non-matching pairs.

CLIP ViT-L/14

Looking at OpenAI’s original model config:1

ViT-L/14

The vision encoder (ViT-L/14) has the following learned parameters:

  • Patch embedding (Conv2d, 3 → 1024, kernel ):

  • Class token:

  • Position embeddings :

  • Per transformer layer (hidden=1024, intermediate=4096):

    • LayerNorm :
    • Attention (Q/K/V/O, each ):
    • MLP FC1:
    • MLP FC2:
    • Per layer total:
  • 24 layers:

  • Final LayerNorm:

  • Visual projection (1024 → 768):

  • Vision subtotal:

Text encoder (no fancy name)

The text encoder has the following learned parameters:

  • Token embeddings:

  • Position embeddings (77 positions):

  • Per transformer layer (hidden=768, intermediate=3072):

    • LayerNorm :
    • Attention (Q/K/V/O, each ):
    • MLP FC1:
    • MLP FC2:
    • Per layer total:
  • 12 layers:

  • Final LayerNorm:

  • Text projection (768 → 768):

  • Text subtotal:

Grand Total

ComponentParameters
Vision encoder303,965,184
Text encoder123,650,304
logit_scale1
Total~427,615,489

Footnotes

  1. config.json · openai/clip-vit-large-patch14 at main