vision transformer

A vision transformer (ViT) is an encoder-only transformer that receives images as inputs and generates embeddings as outputs. It does to images what an embedding lookup table (dictionary) does for text models; it maps raw input (images) into the multidimensional space that transformers operate in.

The ViT (or some other visual encoder) is an essential part of multimodal models and VLMs.

Examples

ViT Name	Parameters	Creator	Used in
14	303M	OpenAI	LLaVA variants
EVA-CLIP ViT-bigG/14		BAAI	Early Qwen

Glenn's Digital Garden

Explorer

vision transformer

Examples

Graph View

Backlinks