The Vision-Language Model (VLM) inference feature in the NVIDIA RAG Blueprint enhances the system's ability to understand and reason about visual content. Unlike traditional image upload systems, this feature operates on image citations that are internally discovered during the retrieval process.
:::{warning} B200 GPUs are not supported for VLM based inferencing in RAG. For this feature, use H100 or A100 GPUs instead. :::
-
Key Use Cases for VLM
- Documents with charts and graphs: Financial reports, scientific papers, business analytics.
- Technical diagrams: Engineering schematics, architectural plans, flowcharts
- Visual data representations: Infographics, tables with visual elements, dashboards
- Mixed content documents: PDFs containing both text and images
- Image-heavy content: Catalogs, product documentation, visual guides
-
Key Benefits of VLM
- Seamless Multimodal Experience – Users don't need to manually upload images; visual content is automatically discovered and analyzed from images embedded in documents.
- Improved Accuracy – Enhanced response quality for documents containing images, charts, diagrams, and visual data.
- Quality Assurance – Internal reasoning ensures only relevant visual insights are used.
- Contextual Understanding – Visual analysis is performed in the context of the user's specific question.
- Fallback Handling – System gracefully handles cases where images are insufficient or irrelevant.
:::{warning} Enabling VLM inference increases response latency from additional image processing and VLM model inference time. Consider this trade-off between accuracy and speed based on your requirements. :::
The VLM feature follows this flow:
- Automatic Image Discovery: When a user query is processed, the RAG system retrieves relevant documents from the vector database. If any of these documents contain images (charts, diagrams, photos, etc.), they are automatically identified.
- Image Captioning at Ingestion: During ingestion, images are extracted and captioned so they can be indexed and later cited for question answering.
- VLM Answer Generation: At query time, the RAG server sends the user question, conversation history, and cited images to a Vision-Language Model. The VLM directly generates the final answer for the user.
There is no separate LLM reasoning step that post-processes the VLM output—once VLM inference is enabled, the VLM is responsible for generating the response (with optional fallback behavior described below).
The VLM feature uses predefined prompts that can be customized to suit your specific needs:
- VLM Analysis Prompt: Located in
src/nvidia_rag/rag_server/prompt.yamlunder thevlm_templatesection.
To customize this prompt, follow the steps outlined in the prompt.yaml file for modifying prompt templates. The vlm_template controls how the question, textual context, and cited images are presented to the VLM.
The VLM model supports two modes that are controlled entirely via the vlm_template:
- Non-reasoning mode (default):
- Template path ends with
/no_think. - Default generation parameters:
APP_VLM_TEMPERATURE=0.1APP_VLM_TOP_P=1.0APP_VLM_MAX_TOKENS=8192
- Template path ends with
- Reasoning mode (chain-of-thought style):
- Change the route in
vlm_templatefrom/no_thinkto/think. - Recommended generation parameters:
APP_VLM_TEMPERATURE=0.3APP_VLM_TOP_P=0.91APP_VLM_MAX_TOKENS=8192
- Change the route in
You can set these parameters via environment variables (for example in docker-compose-rag-server.yaml) or directly through your deployment configuration.
Users interact with the system normally - they ask questions and receive responses. The VLM processing happens transparently in the background:
- User asks a question about content that may have visual elements
- System retrieves relevant documents including any images
- VLM analyzes images and text context if present and relevant
- User receives a single, coherent answer generated directly by the VLM
The following example that uses the Ragbattle dataset demonstrates the accuracy improvement from enabling VLM.
Using the Deloitte's Tax transformation trends survey from May 2021 and the following question:
What is the percentage of companies with NextGen ERP systems/Advanced that said the tax team was highly effective in advising the business on emerging compliance issues?
Before enabling VLM, the system answers 38% with an accuracy score of 0.0. After enabling VLM, the system answers 64% with an accuracy score of 1.0. The answer is found on page 21 of the PDF (page 20 of the document).
The following table shows some approximate accuracy improvements from enabling VLM.
| Query | Correct Answer | Answer Without VLM (Score) | Answer With VLM (Score) | Reason for Improvement |
|---|---|---|---|---|
| Percentage for "…NextGen ERP system/Advanced" on "Effectiveness of the tax team…" graph. | "64%" | "38%" (0.0) | "64%" (1.0) | Precise reading of a charted percentage. |
| Are Business development companies more or less flexible than Mezzanine funds? | "less flexible" | "more flexible" (0.0) | "less flexible" (1.0) | Correct comparative interpretation from a structured source. |
| Estimated cost of capital range for business development companies. | "SOFR+600 to 1,000" | "12-16%" (0.25) | "SOFR+600 to 1,000" (1.0) | Extracted the correct range from a structured chart. |
NVIDIA RAG uses the nemotron-nano-12b-v2-vl Vision-Language Model by default, provided as the vlm-ms service in deploy/compose/nims.yaml.
To start the local VLM NIM service and the other NIMs required for VLM-based generation, run:
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile vlm-generation up -dThis will launch the vlm-ms container (serving the model on port 1977, internal port 8000) together with the embedding and reranker microservices used by the RAG server.
By default, the vlm-ms service uses GPU ID 5. You can customize which GPU to use by setting the VLM_MS_GPU_ID environment variable before starting the service:
export VLM_MS_GPU_ID=2 # Use GPU 2 instead of GPU 5
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile vlm-generation up -dAlternatively, you can modify the nims.yaml file directly to change the GPU assignment:
# In deploy/compose/nims.yaml, locate the vlm-ms service and modify:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['${VLM_MS_GPU_ID:-5}'] # Change 5 to your desired GPU ID
capabilities: [gpu]:::{note} Ensure the specified GPU is available and has sufficient memory for the VLM model. :::
For VLM-based generation to work correctly, images must be extracted and captioned during ingestion:
- In
deploy/compose/docker-compose-ingestor-server.yaml, under theingestor-serverservice, ensure:APP_NVINGEST_EXTRACTIMAGESis set toTrueso images are extracted and stored.- Image captioning is enabled (by default,
APP_NVINGEST_CAPTIONMODELNAMEis set tonvidia/nemotron-nano-12b-v2-vlandAPP_NVINGEST_CAPTIONENDPOINTURLpoints to thevlm-msservice).
When running with Docker Compose you can override these via environment variables, for example:
export APP_NVINGEST_EXTRACTIMAGES=True
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -dThis ensures that images are available as citations and can be sent to the VLM at query time.
Start only the required NIM services (VLM, Embedding, Reranker) using the vlm-generation profile defined in deploy/compose/nims.yaml:
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile vlm-generation up -dThis profile starts the following services and skips nim-llm:
- nemoretriever-embedding-ms
- nemoretriever-ranking-ms
- vlm-ms
Set the following environment variables in docker-compose-rag-server.yaml to enable VLM inference in RAG server:
export ENABLE_VLM_INFERENCE="true"
export APP_VLM_MODELNAME="nvidia/nemotron-nano-12b-v2-vl"
export APP_VLM_SERVERURL="http://vlm-ms:8000/v1"
# Apply by restarting rag-server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -dENABLE_VLM_INFERENCE: Enables VLM inference in the RAG server.APP_VLM_MODELNAME: The name of the VLM model to use (default:nvidia/nemotron-nano-12b-v2-vl).APP_VLM_SERVERURL: The URL of the VLM NIM server (local or remote).
Once ENABLE_VLM_INFERENCE is set, the RAG server uses the VLM to generate the final answer. The VLM_TO_LLM_FALLBACK flag controls what happens when no images are available, as described later.
Continue following the rest of the steps in Deploy with Docker (Self-Hosted Models) to deploy the ingestion-server and rag-server containers.
To use a remote NVIDIA-hosted NIM for VLM inference:
- Set the
APP_VLM_SERVERURLenvironment variable to the remote endpoint provided by NVIDIA:
export ENABLE_VLM_INFERENCE="true"
export APP_VLM_MODELNAME="nvidia/nemotron-nano-12b-v2-vl"
export APP_VLM_SERVERURL="https://integrate.api.nvidia.com/v1/"
# Apply by restarting rag-server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -dContinue following the rest of the steps in Deploy with Docker (NVIDIA-Hosted Models) to deploy the ingestion-server and rag-server containers.
:::{note} On prem deployment of the VLM model requires an additional 1xH100 or 1xB200 GPU in default deployment configuration. If MIG slicing is enabled on the cluster, ensure to assign a dedicated slice to the VLM. Check mig-deployment.md and values-mig.yaml for more information. :::
To enable VLM inference in Helm-based deployments, follow these steps:
-
Set VLM environment variables in
values.yamlIn your values.yaml file, under the
envVarssection, set the following environment variables:ENABLE_VLM_INFERENCE: "true" APP_VLM_MODELNAME: "nvidia/nemotron-nano-12b-v2-vl" APP_VLM_SERVERURL: "http://nim-vlm:8000/v1" # Local VLM NIM endpoint
Also enable the nim-vlm helm chart
nim-vlm:
enabled: true-
Apply the updated Helm chart
Run the following command to upgrade or install your deployment:
helm upgrade --install rag -n <namespace> https://helm.ngc.nvidia.com/nvstaging/blueprint/charts/nvidia-blueprint-rag-v2.4.0-rc1.tgz \ --username '$oauthtoken' \ --password "${NGC_API_KEY}" \ --set imagePullSecret.password=$NGC_API_KEY \ --set ngcApiSecret.password=$NGC_API_KEY \ -f deploy/helm/nvidia-blueprint-rag/values.yaml -
Check if the VLM pod has come up
A pod with the name rag-0 will start, this pod corresponds to the VLM model deployment.
```
rag rag-0 0/1 ContainerCreating 0 6m37s
```
:::{note}
For local VLM inference, ensure the VLM NIM service is running and accessible at the configured APP_VLM_SERVERURL. For remote endpoints, the NGC_API_KEY is required for authentication.
:::
VLM processing is triggered when:
ENABLE_VLM_INFERENCEis set totrue- The VLM service is accessible and responding
Once VLM inference is enabled, the RAG server uses the VLM to generate the final answer. The VLM_TO_LLM_FALLBACK flag controls behavior only when no images are present in the query, messages, or retrieved context:
- If
VLM_TO_LLM_FALLBACK="false"(default): the pipeline still routes generation through the VLM, even for text-only queries with no images. - If
VLM_TO_LLM_FALLBACK="true": text-only queries (with no images in the query, messages, or context) fall back to the regular LLM-based RAG flow instead of calling the VLM.
- Ensure the VLM NIM is running and accessible at the configured
APP_VLM_SERVERURL. - For remote endpoints, ensure your
NGC_API_KEYis valid and has access to the requested model. - Check rag-server logs for errors related to VLM inference or API authentication.
- Verify that images are properly ingested, captioned, and indexed in your knowledge base.
Control how many images are sent to the VLM per request:
APP_VLM_MAX_TOTAL_IMAGES(default: 5): Maximum total images (from the query, conversation history, and retrieved context) that are included in the VLM prompt. The pipeline will never exceed this.
Example (docker compose):
export ENABLE_VLM_INFERENCE="true"
export APP_VLM_MAX_TOTAL_IMAGES="5"
# Apply by restarting rag-server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -dImportant Notes:
- When this flag is enabled and images are provided as input (either from context or query), the VLM response will always be used as the final answer
- This mode is useful when you want pure visual analysis without additional text interpretation or reasoning
- The response will be based solely on what the VLM can extract from the images, without incorporating textual context from retrieved documents
To enable final-answer mode with Helm (skip nim-llm and return the VLM output directly):
- In your
values.yamlfor the chart atdeploy/helm/nvidia-blueprint-rag/values.yaml, set the following underenvVars:
ENABLE_VLM_INFERENCE: "true"- Enable the VLM NIM and disable the LLM NIM:
nim-vlm:
enabled: true
nim-llm:
enabled: false- (Optional, recommended) Ensure features that depend on the LLM remain disabled:
ENABLE_QUERYREWRITER: "False"
ENABLE_REFLECTION: "False"- Apply or upgrade the release:
helm upgrade --install rag -n <namespace> https://helm.ngc.nvidia.com/nvstaging/blueprint/charts/nvidia-blueprint-rag-v2.4.0-dev-rc1.tgz \
--username '$oauthtoken' \
--password "${NGC_API_KEY}" \
--set imagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.password=$NGC_API_KEY \
-f deploy/helm/nvidia-blueprint-rag/values.yaml:::{note}
In this mode, the RAG server will use the VLM output as the final response. Keep the embedding and reranker services enabled as in the default chart configuration. If you use a local VLM, also set APP_VLM_SERVERURL (for example, http://nim-vlm:8000/v1) and enable the nim-vlm subchart as shown above.
:::
:::{warning} The VLM receives the current user query, a truncated conversation history, and a textual summary of retrieved documents, together with any cited images. The effective context window of the VLM is limited, so very long conversations or large document contexts may be truncated. :::
Mitigations:
- Keep user questions as self-contained as possible, especially in long-running conversations.
- Use retrieval and prompt tuning to focus the most relevant context for the VLM.