Vision-Language Model (VLM) for Generation for NVIDIA RAG Blueprint

The Vision-Language Model (VLM) inference feature in the NVIDIA RAG Blueprint enhances the system's ability to understand and reason about visual content. Unlike traditional image upload systems, this feature operates on image citations that are internally discovered during the retrieval process.

:::{warning} B200 GPUs are not supported for VLM based inferencing in RAG. For this feature, use H100 or A100 GPUs instead. :::

Key Use Cases for VLM
- Documents with charts and graphs: Financial reports, scientific papers, business analytics.
- Technical diagrams: Engineering schematics, architectural plans, flowcharts
- Visual data representations: Infographics, tables with visual elements, dashboards
- Mixed content documents: PDFs containing both text and images
- Image-heavy content: Catalogs, product documentation, visual guides
Key Benefits of VLM
- Seamless Multimodal Experience – Users don't need to manually upload images; visual content is automatically discovered and analyzed from images embedded in documents.
- Improved Accuracy – Enhanced response quality for documents containing images, charts, diagrams, and visual data.
- Quality Assurance – Internal reasoning ensures only relevant visual insights are used.
- Contextual Understanding – Visual analysis is performed in the context of the user's specific question.
- Fallback Handling – System gracefully handles cases where images are insufficient or irrelevant.

:::{warning} Enabling VLM inference increases response latency from additional image processing and VLM model inference time. Consider this trade-off between accuracy and speed based on your requirements. :::

How VLM Works in the RAG Pipeline

The VLM feature follows this flow:

Automatic Image Discovery: When a user query is processed, the RAG system retrieves relevant documents from the vector database. If any of these documents contain images (charts, diagrams, photos, etc.), they are automatically identified.
Image Captioning at Ingestion: During ingestion, images are extracted and captioned so they can be indexed and later cited for question answering.
VLM Answer Generation: At query time, the RAG server sends the user question, conversation history, and cited images to a Vision-Language Model. The VLM directly generates the final answer for the user.

There is no separate LLM reasoning step that post-processes the VLM output—once VLM inference is enabled, the VLM is responsible for generating the response (with optional fallback behavior described below).

Prompt customization

The VLM feature uses predefined prompts that can be customized to suit your specific needs:

VLM Analysis Prompt: Located in src/nvidia_rag/rag_server/prompt.yaml under the vlm_template section.

To customize this prompt, follow the steps outlined in the prompt.yaml file for modifying prompt templates. The vlm_template controls how the question, textual context, and cited images are presented to the VLM.

VLM reasoning vs. non-reasoning mode

The VLM model supports two modes that are controlled entirely via the vlm_template:

Non-reasoning mode (default):
- Template path ends with /no_think.
- Default generation parameters:
  - APP_VLM_TEMPERATURE=0.1
  - APP_VLM_TOP_P=1.0
  - APP_VLM_MAX_TOKENS=8192
Reasoning mode (chain-of-thought style):
- Change the route in vlm_template from /no_think to /think.
- Recommended generation parameters:
  - APP_VLM_TEMPERATURE=0.3
  - APP_VLM_TOP_P=0.91
  - APP_VLM_MAX_TOKENS=8192

You can set these parameters via environment variables (for example in docker-compose-rag-server.yaml) or directly through your deployment configuration.

What Users Experience

Users interact with the system normally - they ask questions and receive responses. The VLM processing happens transparently in the background:

User asks a question about content that may have visual elements
System retrieves relevant documents including any images
VLM analyzes images and text context if present and relevant
User receives a single, coherent answer generated directly by the VLM

Accuracy Improvement Example

The following example that uses the Ragbattle dataset demonstrates the accuracy improvement from enabling VLM.

Using the Deloitte's Tax transformation trends survey from May 2021 and the following question:

What is the percentage of companies with NextGen ERP systems/Advanced that said the tax team was highly effective in advising the business on emerging compliance issues?

Before enabling VLM, the system answers 38% with an accuracy score of 0.0. After enabling VLM, the system answers 64% with an accuracy score of 1.0. The answer is found on page 21 of the PDF (page 20 of the document).

The following table shows some approximate accuracy improvements from enabling VLM.

Query	Correct Answer	Answer Without VLM (Score)	Answer With VLM (Score)	Reason for Improvement
Percentage for "…NextGen ERP system/Advanced" on "Effectiveness of the tax team…" graph.	"64%"	"38%" (0.0)	"64%" (1.0)	Precise reading of a charted percentage.
Are Business development companies more or less flexible than Mezzanine funds?	"less flexible"	"more flexible" (0.0)	"less flexible" (1.0)	Correct comparative interpretation from a structured source.
Estimated cost of capital range for business development companies.	"SOFR+600 to 1,000"	"12-16%" (0.25)	"SOFR+600 to 1,000" (1.0)	Extracted the correct range from a structured chart.

Start the VLM NIM Service (Local)

NVIDIA RAG uses the nemotron-nano-12b-v2-vl Vision-Language Model by default, provided as the vlm-ms service in deploy/compose/nims.yaml.

To start the local VLM NIM service and the other NIMs required for VLM-based generation, run:

USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile vlm-generation up -d

This will launch the vlm-ms container (serving the model on port 1977, internal port 8000) together with the embedding and reranker microservices used by the RAG server.

Customizing GPU Usage for VLM Service (Optional)

By default, the vlm-ms service uses GPU ID 5. You can customize which GPU to use by setting the VLM_MS_GPU_ID environment variable before starting the service:

export VLM_MS_GPU_ID=2  # Use GPU 2 instead of GPU 5
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile vlm-generation up -d

Alternatively, you can modify the nims.yaml file directly to change the GPU assignment:

# In deploy/compose/nims.yaml, locate the vlm-ms service and modify:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          device_ids: ['${VLM_MS_GPU_ID:-5}']  # Change 5 to your desired GPU ID
          capabilities: [gpu]

:::{note} Ensure the specified GPU is available and has sufficient memory for the VLM model. :::

Enable image extraction and captioning for VLM

For VLM-based generation to work correctly, images must be extracted and captioned during ingestion:

In deploy/compose/docker-compose-ingestor-server.yaml, under the ingestor-server service, ensure:
- APP_NVINGEST_EXTRACTIMAGES is set to True so images are extracted and stored.
- Image captioning is enabled (by default, APP_NVINGEST_CAPTIONMODELNAME is set to nvidia/nemotron-nano-12b-v2-vl and APP_NVINGEST_CAPTIONENDPOINTURL points to the vlm-ms service).

When running with Docker Compose you can override these via environment variables, for example:

export APP_NVINGEST_EXTRACTIMAGES=True

docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d

This ensures that images are available as citations and can be sent to the VLM at query time.

Enable VLM Inference in RAG Server

Start only the required NIM services (VLM, Embedding, Reranker) using the vlm-generation profile defined in deploy/compose/nims.yaml:

USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile vlm-generation up -d

This profile starts the following services and skips nim-llm:

nemoretriever-embedding-ms
nemoretriever-ranking-ms
vlm-ms

Set the following environment variables in docker-compose-rag-server.yaml to enable VLM inference in RAG server:

export ENABLE_VLM_INFERENCE="true"
export APP_VLM_MODELNAME="nvidia/nemotron-nano-12b-v2-vl"
export APP_VLM_SERVERURL="http://vlm-ms:8000/v1"

# Apply by restarting rag-server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

ENABLE_VLM_INFERENCE: Enables VLM inference in the RAG server.
APP_VLM_MODELNAME: The name of the VLM model to use (default: nvidia/nemotron-nano-12b-v2-vl).
APP_VLM_SERVERURL: The URL of the VLM NIM server (local or remote).

Once ENABLE_VLM_INFERENCE is set, the RAG server uses the VLM to generate the final answer. The VLM_TO_LLM_FALLBACK flag controls what happens when no images are available, as described later.

Continue following the rest of the steps in Deploy with Docker (Self-Hosted Models) to deploy the ingestion-server and rag-server containers.

Using a Remote NVIDIA-Hosted NIM Endpoint (Optional)

To use a remote NVIDIA-hosted NIM for VLM inference:

Set the APP_VLM_SERVERURL environment variable to the remote endpoint provided by NVIDIA:

export ENABLE_VLM_INFERENCE="true"
export APP_VLM_MODELNAME="nvidia/nemotron-nano-12b-v2-vl"
export APP_VLM_SERVERURL="https://integrate.api.nvidia.com/v1/"

# Apply by restarting rag-server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

Continue following the rest of the steps in Deploy with Docker (NVIDIA-Hosted Models) to deploy the ingestion-server and rag-server containers.

Using Helm Chart Deployment

:::{note} On prem deployment of the VLM model requires an additional 1xH100 or 1xB200 GPU in default deployment configuration. If MIG slicing is enabled on the cluster, ensure to assign a dedicated slice to the VLM. Check mig-deployment.md and values-mig.yaml for more information. :::

To enable VLM inference in Helm-based deployments, follow these steps:

Set VLM environment variables in values.yaml

In your values.yaml file, under the envVars section, set the following environment variables:

ENABLE_VLM_INFERENCE: "true"
APP_VLM_MODELNAME: "nvidia/nemotron-nano-12b-v2-vl"
APP_VLM_SERVERURL: "http://nim-vlm:8000/v1"  # Local VLM NIM endpoint

Also enable the nim-vlm helm chart

nim-vlm:
  enabled: true

Apply the updated Helm chart

Run the following command to upgrade or install your deployment:

helm upgrade --install rag -n <namespace> https://helm.ngc.nvidia.com/nvstaging/blueprint/charts/nvidia-blueprint-rag-v2.4.0-rc1.tgz \
  --username '$oauthtoken' \
  --password "${NGC_API_KEY}" \
  --set imagePullSecret.password=$NGC_API_KEY \
  --set ngcApiSecret.password=$NGC_API_KEY \
  -f deploy/helm/nvidia-blueprint-rag/values.yaml

Check if the VLM pod has come up

A pod with the name rag-0 will start, this pod corresponds to the VLM model deployment.

```
  rag       rag-0       0/1     ContainerCreating   0          6m37s
```

:::{note} For local VLM inference, ensure the VLM NIM service is running and accessible at the configured APP_VLM_SERVERURL. For remote endpoints, the NGC_API_KEY is required for authentication. :::

When VLM Processing Occurs

VLM processing is triggered when:

ENABLE_VLM_INFERENCE is set to true
The VLM service is accessible and responding

Once VLM inference is enabled, the RAG server uses the VLM to generate the final answer. The VLM_TO_LLM_FALLBACK flag controls behavior only when no images are present in the query, messages, or retrieved context:

If VLM_TO_LLM_FALLBACK="false" (default): the pipeline still routes generation through the VLM, even for text-only queries with no images.
If VLM_TO_LLM_FALLBACK="true": text-only queries (with no images in the query, messages, or context) fall back to the regular LLM-based RAG flow instead of calling the VLM.

Troubleshooting

Ensure the VLM NIM is running and accessible at the configured APP_VLM_SERVERURL.
For remote endpoints, ensure your NGC_API_KEY is valid and has access to the requested model.
Check rag-server logs for errors related to VLM inference or API authentication.
Verify that images are properly ingested, captioned, and indexed in your knowledge base.

Configure VLM image limits

Control how many images are sent to the VLM per request:

APP_VLM_MAX_TOTAL_IMAGES (default: 5): Maximum total images (from the query, conversation history, and retrieved context) that are included in the VLM prompt. The pipeline will never exceed this.

Example (docker compose):

export ENABLE_VLM_INFERENCE="true"
export APP_VLM_MAX_TOTAL_IMAGES="5"

# Apply by restarting rag-server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

Important Notes:

When this flag is enabled and images are provided as input (either from context or query), the VLM response will always be used as the final answer
This mode is useful when you want pure visual analysis without additional text interpretation or reasoning
The response will be based solely on what the VLM can extract from the images, without incorporating textual context from retrieved documents

Use VLM response as the final answer (Helm)

To enable final-answer mode with Helm (skip nim-llm and return the VLM output directly):

In your values.yaml for the chart at deploy/helm/nvidia-blueprint-rag/values.yaml, set the following under envVars:

ENABLE_VLM_INFERENCE: "true"

Enable the VLM NIM and disable the LLM NIM:

nim-vlm:
  enabled: true

nim-llm:
  enabled: false

(Optional, recommended) Ensure features that depend on the LLM remain disabled:

ENABLE_QUERYREWRITER: "False"
ENABLE_REFLECTION: "False"

Apply or upgrade the release:

helm upgrade --install rag -n <namespace> https://helm.ngc.nvidia.com/nvstaging/blueprint/charts/nvidia-blueprint-rag-v2.4.0-dev-rc1.tgz \
  --username '$oauthtoken' \
  --password "${NGC_API_KEY}" \
  --set imagePullSecret.password=$NGC_API_KEY \
  --set ngcApiSecret.password=$NGC_API_KEY \
  -f deploy/helm/nvidia-blueprint-rag/values.yaml

:::{note} In this mode, the RAG server will use the VLM output as the final response. Keep the embedding and reranker services enabled as in the default chart configuration. If you use a local VLM, also set APP_VLM_SERVERURL (for example, http://nim-vlm:8000/v1) and enable the nim-vlm subchart as shown above. :::

Conversation history and context limitations

:::{warning} The VLM receives the current user query, a truncated conversation history, and a textual summary of retrieved documents, together with any cited images. The effective context window of the VLM is limited, so very long conversations or large document contexts may be truncated. :::

Mitigations:

Keep user questions as self-contained as possible, especially in long-running conversations.
Use retrieval and prompt tuning to focus the most relevant context for the VLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision-Language Model (VLM) for Generation for NVIDIA RAG Blueprint

How VLM Works in the RAG Pipeline

Prompt customization

VLM reasoning vs. non-reasoning mode

What Users Experience

Accuracy Improvement Example

Start the VLM NIM Service (Local)

Customizing GPU Usage for VLM Service (Optional)

Enable image extraction and captioning for VLM

Enable VLM Inference in RAG Server

Using a Remote NVIDIA-Hosted NIM Endpoint (Optional)

Using Helm Chart Deployment

When VLM Processing Occurs

Troubleshooting

Configure VLM image limits

Use VLM response as the final answer (Helm)

Conversation history and context limitations

Related Topics

FilesExpand file tree

vlm.md

Latest commit

History

vlm.md

File metadata and controls

Vision-Language Model (VLM) for Generation for NVIDIA RAG Blueprint

How VLM Works in the RAG Pipeline

Prompt customization

VLM reasoning vs. non-reasoning mode

What Users Experience

Accuracy Improvement Example

Start the VLM NIM Service (Local)

Customizing GPU Usage for VLM Service (Optional)

Enable image extraction and captioning for VLM

Enable VLM Inference in RAG Server

Using a Remote NVIDIA-Hosted NIM Endpoint (Optional)

Using Helm Chart Deployment

When VLM Processing Occurs

Troubleshooting

Configure VLM image limits

Use VLM response as the final answer (Helm)

Conversation history and context limitations

Related Topics