A vision transformer (ViT) is an encoder-only transformer that receives images as inputs and generates embeddings as outputs. It does to images what an embedding lookup table (dictionary) does for text models; it maps raw input (images) into the multidimensional space that transformers operate in.

The ViT (or some other visual encoder) is an essential part of multimodal models and VLMs.

Examples

ViT NameParametersCreatorUsed in
14303MOpenAILLaVA variants
EVA-CLIP ViT-bigG/14BAAIEarly Qwen