Vision Language Models

Vision language models (VLMs) are multimodal generative AI models capable of reasoning over text, image, and video prompts.

What Are Vision Language Models?

Vision language models are multimodal AI systems built by combining a large language model (LLM) with a vision encoder, giving the LLM the ability to “see.”

Similar to LLMs, VLMs can understand text input, provide advanced reasoning, and generate text responses—with the added ability to process image inputs supplied in the prompt.

Figure 4: Example multiple-choice questions for VLMs used in the MMMU benchmark

Source (MMMU )

Unlike traditional computer vision models, VLMs are not bound by a fixed set of classes or a specific task like classification or detection. The combination of a vision encoder and LLM, both pretrained on vast corpora of text and image-caption pairs, allows VLMs to be instructed in natural language and generalize to nearly any type of vision task.

Why Are Vision Language Models Important?

To understand the importance of VLMs, it’s helpful to know how past computer vision (CV) models work. Traditional convolutional neural network (CNN)-based CV models are trained for a specific task on a bounded set of classes. For example:

  • A classification model that identifies whether an image contains a cat or a dog
  • An optical character detection and recognition CV model that reads text in an image but doesn’t interpret the format or any visual data within a document

Previous CV models were trained for a specific purpose and did not have the ability to go beyond the task or set of classes they were developed for and trained on. If the use case changed at all or required a new class to be added to the model, a developer would have to collect and label a large number of images and retrain the model. This is an expensive, time-consuming process. Additionally, CV models don't have any natural language understanding.

VLMs bring a new class of capabilities by combining the power of foundation models, like CLIP , and LLMs to have both vision and language capabilities. Out of the box, VLMs have strong zero-shot performance on a variety of vision tasks, like visual question-answering, classification, and optical character recognition. They are also extremely flexible and can be used not just on a fixed set of classes but for nearly any use case by simply changing a text prompt.

Using a VLM is very similar to interacting with an LLM. The user supplies text prompts that can be interleaved with images. The inputs are then used to generate text output. The input prompts are open-ended, allowing the user to instruct the VLM to answer questions, summarize, explain the content, or reason with the image. Users can chat back and forth with the VLM, with the ability to add images into the context of the conversation. VLMs can also be integrated into visual agents to autonomously perform vision tasks.

How Do Vision Language Models Work?

Most VLMs follow an architecture with three parts:

  • A vision encoder
  • A projector
  • An LLM

The vision encoder is typically a CLIP-based model with a transformer architecture that has been trained on millions of image-text pairs, giving it the ability to associate images and text. The projector is a set of layers that translates the output of the vision encoder into a form the LLM can understand, often interpreted as image tokens. This projector can be a simple line layer like LLaVA and VILA, or something more complex like the cross-attention layers used in Llama 3.2 Vision.

Any off-the-shelf LLM can be used to build a VLM. There are hundreds of VLM variants that combine various LLMs with vision encoders.

Figure 2: A common three-part architecture for vision language models

How Are Vision Language Models Trained?

VLMs are trained in several stages that include pretraining, followed by supervised fine-tuning. Optionally, parameter efficient fine-tuning (PEFT) can be applied as a final stage to create a domain-specific VLM on custom data.

The pretraining stage aligns the vision encoder, projector, and LLM to essentially speak the same language when interpreting the text and image input. This is done using large corpora of text and images with image-caption pairs and interleaved image-text data. Once the three components have been aligned through pretraining, the VLM goes through a supervised fine-tuning stage to help it understand how to respond to user prompts.

The data used in this stage are a blend of example prompts with text and/or image input and the expected response of the model. For example, this data could be prompts telling the model to describe the image or to count all the objects in the frame with the expected correct response. After this round of training, the VLM will understand how to best interpret images and respond to user prompts.

Figure 3: Training for VLMs is often done in several stages to target certain parts of the model

Once the VLM is trained, it can be used in the same way as an LLM by providing prompts that can also include images interleaved in text. The VLM will then generate a text response based on the inputs. VLMs are typically deployed with an OpenAI style REST API interface to make it easy to interact with the model.

More advanced techniques are currently being researched to enhance vision capabilities:

  • Ensembling vision encoders to process image inputs
  • Breaking apart high-resolution image inputs into smaller tiles for processing
  • Increasing context length to improve long video understanding

All of these advancements are progressing the capabilities of VLMs from only understanding single-image input to being highly capable models that can compare and contrast images, accurately read text, understand long videos, and have strong spatial understanding.

How Are Vision Language Models Benchmarked?

Several common benchmarks, such MMMU , Video-MME , MathVista , ChartQA , and DocVQA , exist to determine how well vision-language models perform on a variety of tasks, such as:

  • Visual question-answering
  • Logic and reasoning
  • Document understanding
  • Multi-image comparisons
  • Video understanding

Most benchmarks consist of a set of images with several associated questions, often posed as multiple-choice questions. The multiple-choice format is the easiest way to consistently benchmark and compare VLMs. These questions test the VLMs perception, knowledge, and reasoning capabilities. When running these benchmarks, the VLM is provided with the image, question, and several multiple-choice answers it must choose from.

Figure 4: Example multiple-choice questions for VLMs used in the MMMU benchmark

Source (MMMU )

The accuracy of the VLM is the number of correct choices over the set of multiple-choice questions. Some benchmarks also include numerical questions where the VLM must perform a specific calculation and be within a certain percentage of the answer to be considered correct. Often these questions and images come from academic sources, such as college-level textbooks.

How Are Vision Language Models Used?

VLMs are quickly becoming the go-to tool for all types of vision-related tasks due to their flexibility and natural language understanding. VLMs can be easily instructed to perform a wide variety of tasks through natural language:

  1. Visual questions-answering
  2. Image and video summarization
  3. Parsing text and handwritten documents

Previous applications that would have required a large ensemble of specially trained models can now be accomplished with just a single VLM.

VLMs are especially good at summarizing the contents of images and can be prompted to perform specific tasks based on the contents. Take for example, an education use case—a VLM could be given an image of a handwritten math problem, and it could use its optical character recognition and reasoning capabilities to interpret the problem and produce a step-by-step guide on how to solve it. VLMs can not only understand the content of the image but also reason and perform specific tasks.

Figure 5: Visual AI agents transform video and image data into real-world insights

With vast amounts of video being produced every day, it's infeasible to review and extract insights from this volume of video that is produced by all industries. VLMs can be integrated into a larger system to build visual AI agents capable of detecting specific events when prompted. These systems could be used to detect malfunctioning robots in a warehouse or generate out-of-stock alerts when shelves are empty. Their general understanding goes beyond simple detection and could be used to generate automated reports. For example, an intelligent traffic system could detect, analyze, and produce reports of traffic hazards, such as fallen trees, stalled vehicles, or collisions.

VLMs can be used with technologies like graph databases to understand long videos. This helps them capture the complexity of objects and events in a video. Such systems could be used to summarize operations in a warehouse to find bottlenecks and inefficiencies or produce sports commentary for football, basketball, or soccer games.

What Are the Challenges of Vision Language Models?

Vision language models are maturing quickly, but they still have some limitations, particularly around spatial understanding and long-context video understanding.

Most VLMs use CLIP-based models as the vision encoder, which are limited to 224x224 or 336x336 image input size. This relatively small input image makes it difficult for small objects and details to be detected. For example, an HD 1080x1920 frame from a video must be downsized or cropped to a much smaller input resolution, making it difficult to retain details for small objects or fine details. To fix this, VLMs are starting to use tiling methods that allow a big image to be broken into smaller pieces and then fed into the model. There's also ongoing research to explore the use of higher-resolution image encoders.

VLMs also have difficulty providing precise locations for objects. The training data for CLIP-based vision encoders consists mostly of short text descriptions of images, like captions. These descriptions don't include detailed, fine-grained object locations, and this limitation impacts CLIP’s spatial understanding. This is inherited by VLMs that use it as a vision encoder. New approaches are exploring the use of ensembling several vision encoders to address these limitations 2408.15998 (arxiv.org) .

Long video understanding is a challenge due to the need to take into account visual information across potential hours of video to properly analyze or answer questions. Like LLMs, VLMs have limited context length meaning—only a certain number of frames from a video can be included to answer questions. Approaches to increase context length and train VLMs on more video-based data are being researched, such as LongVILA 2408.10188 (arxiv.org) .

VLMs may not have seen enough data for very specific use cases, such as finding manufacturing defects in a specific product line. This limitation can be overcome by fine-tuning the VLM on domain-specific data or using multi-image VLMs with in-context learning to provide examples that can teach the model new information without explicitly training the model. Training the model on domain-specific data with PEFT is another technique that can be used to improve a VLM’s accuracy on custom data.

How Can You Get Started With Vision Language Models?

NVIDIA offers tools to ease the building and deployment of vision language models:

  • NVIDIA NIM™: NVIDIA NIM is a set of inference microservices that includes industry-standard APIs, domain-specific code, optimized inference engines, and enterprise runtime. Check out the VLM NIMs available today here. We created NIM reference workflows to help you get started.
  • NVIDIA AI Blueprints: NVIDIA AI Blueprints are reference workflows for generative AI use cases, built with NVIDIA NIM microservices as part of the NVIDIA AI Enterprise Platform. The NVIDIA AI Blueprint for video search and summarization helps you build and customize interactive visual AI agents capable of understanding activity within massive volumes of live or archived video using vision VLMs, LLMs, and RAG.

Next Steps

Learn Visual AI Agents

A visual AI agent can combine both vision and language modalities to understand natural language prompts and perform visual question-answering.

Try NVIDIA AI Blueprint

Learn the technical details of the NVIDIA AI Blueprint for video search and summarization, integrating complex VLMs, LLMs, and RAG with supporting microservices.

Explore With NIM and Reference Workflows

Discover the technical details of NVIDIA VLM NIM microservices and reference workflows.

  翻译: