CLIP

Contrastive Language-Image Pre-training (CLIP) is a training method and a class of models where pairs of images and natural-language captions are fed into a dual-encoder transformer to teach it how to associate images and language. A trained CLIP model can recognize which image and caption belong together.

And by dual-encoder, we mean:

It has a vision transformer for its visual encoder
It has a text encoder for a text encoder

Training

CLIP is trained with batches of $N$ pairs of image and text. The way it works is:

You compute all $N$ image embeddings
You compute all $N$ text embeddings
You normalize those embeddings so they’re all the same length
You compute a $N \times N$ similarity matrix (cosine similarity)

Cosine similarity has a tiny dynamic range of $[- 1, 1]$ , so a single scaling factor (which is learned) is applied. It is learned so that the model can calibrate the sharpness of the similarity distribution and confidently distinguish matching image-caption pairs from non-matching pairs.

CLIP ViT-L/14

Looking at OpenAI’s original model config:¹

ViT-L/14

The vision encoder (ViT-L/14) has the following learned parameters:

Patch embedding (Conv2d, 3 → 1024, kernel $14 \times 14$ ): $3 \times 14 \times 14 \times 1024 + 1024 = 603, 136$
Class token: $1, 024$
Position embeddings $([\frac{224}{14}]^{2} + 1 = 257 positions)$ : $257 \times 1024 = 263, 168$
Per transformer layer (hidden=1024, intermediate=4096):
- LayerNorm $\times 2$ : $2 \times 2 \times 1024 = 4, 096$
- Attention (Q/K/V/O, each $1024 \times 1024 + bias$ ): $4 \times (102 4^{2} + 1024) = 4, 198, 400$
- MLP FC1: $1024 \times 4096 + 4096 = 4, 198, 400$
- MLP FC2: $4096 \times 1024 + 1024 = 4, 195, 328$
- Per layer total: $12, 596, 224$
24 layers: $24 \times 12, 596, 224 = 302, 309, 376$
Final LayerNorm: $2, 048$
Visual projection (1024 → 768): $1024 \times 768 = 786, 432$
Vision subtotal: $\approx 303, 965, 184$

Text encoder (no fancy name)

The text encoder has the following learned parameters:

Token embeddings: $49, 408 \times 768 = 37, 945, 344$
Position embeddings (77 positions): $77 \times 768 = 59, 136$
Per transformer layer (hidden=768, intermediate=3072):
- LayerNorm $\times 2$ : $3, 072$
- Attention (Q/K/V/O, each $768 \times 768 + bias$ ): $4 \times (76 8^{2} + 768) = 2, 362, 368$
- MLP FC1: $768 \times 3072 + 3072 = 2, 362, 368$
- MLP FC2: $3072 \times 768 + 768 = 2, 360, 064$
- Per layer total: $7, 087, 872$
12 layers: $12 \times 7, 087, 872 = 85, 054, 464$
Final LayerNorm: $1, 536$
Text projection (768 → 768): $768 \times 768 = 589, 824$
Text subtotal: $\approx 123, 650, 304$

Grand Total

Component	Parameters
Vision encoder	303,965,184
Text encoder	123,650,304
logit_scale	1
Total	~427,615,489

config.json · openai/clip-vit-large-patch14 at main ↩

Glenn's Digital Garden

Explorer

CLIP

Training

CLIP ViT-L/14

ViT-L/14

Text encoder (no fancy name)

Grand Total

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

CLIP

Training

CLIP ViT-L/14

ViT-L/14

Text encoder (no fancy name)

Grand Total

Footnotes

Graph View

Table of Contents

Backlinks