A vision transformer (ViT) is an encoder-only transformer that receives images as inputs and generates embeddings as outputs. It does to images what an embedding lookup table (dictionary) does for text models; it maps raw input (images) into the multidimensional space that transformers operate in.
The ViT (or some other visual encoder) is an essential part of multimodal models and VLMs.