The multimodal query feature in the NVIDIA RAG Blueprint enables you to query your knowledge base using both text and images. This is particularly useful for use cases where visual context enhances the query, such as:
- Product identification: "What is the price of this item?" + product image
- Document lookup: "Find documents related to this chart" + chart image
- Visual Q&A: "What material is this made of?" + product image
This feature combines:
- VLM Embeddings:
nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1for creating multimodal embeddings that understand both text and images - Vision-Language Model:
nvidia/nemotron-nano-12b-v2-vlfor generating intelligent responses based on visual and textual context
Before enabling multimodal query support, ensure you have:
- Obtained an API Key
- Deployed the NVIDIA RAG Blueprint
- An NVIDIA H100 or A100 GPU for on-prem deployments
Use this section to deploy multimodal query support with locally hosted NVIDIA NIMs.
Start the Milvus vector database service:
docker compose -f deploy/compose/vectordb.yaml up -dDeploy the Vision-Language Model and multimodal embedding services:
# Create the model cache directory
mkdir -p ~/.cache/model-cache
export MODEL_DIRECTORY=~/.cache/model-cache
# Set your NGC API key
export NGC_API_KEY="nvapi-..."
# (Optional) Select a specific GPU for the VLM Microservice
# Use `nvidia-smi` to check available GPUs and set the desired GPU ID
export VLM_MS_GPU_ID=0 # Default is GPU 0; change to use a different GPU
# Deploy NIMs with VLM and VLM embedding profiles
USERID=$(id -u) docker compose --profile vlm-ingest --profile vlm-only -f deploy/compose/nims.yaml up -d:::{warning} The first deployment may take 10-20 minutes as models download (~10GB+). Subsequent deployments will be faster as models are cached. :::
Monitor the deployment status:
watch -n 5 'docker ps --format "table {{.Names}}\t{{.Status}}"'Wait until the services show as healthy:
Set the model names and service URLs for the RAG pipeline:
# VLM (Vision-Language Model) configuration
export APP_VLM_MODELNAME="nvidia/nemotron-nano-12b-v2-vl"
export APP_VLM_SERVERURL="http://vlm-ms:8000/v1"
export APP_LLM_SERVERURL=""
# Multimodal embedding model configuration
export APP_EMBEDDINGS_MODELNAME="nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
export APP_EMBEDDINGS_SERVERURL="nemoretriever-vlm-embedding-ms:8000/v1"Enable image extraction and storage during document ingestion:
# Configure image extraction
export APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY=""
export APP_NVINGEST_IMAGE_ELEMENTS_MODALITY="image"
export APP_NVINGEST_EXTRACTIMAGES="True"
# Disable reranker (not supported with multimodal queries)
export APP_RANKING_SERVERURL=""docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d --buildVerify the service is healthy
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --buildVerify the service is healthy
Check the status of all deployed containers:
docker ps --format "table {{.Names}}\t{{.Status}}"Confirm all the containers are running and healthy
Use this section to deploy multimodal query support using NVIDIA-hosted API endpoints.
:::{note} When using NVIDIA-hosted endpoints, you might encounter rate limiting with larger file ingestions (>10 files). For details, see Troubleshoot. :::
docker compose -f deploy/compose/vectordb.yaml up -da. Open deploy/compose/.env and uncomment the section Endpoints for using cloud NIMs. Then set the environment variables by running the following code.
source deploy/compose/.env# Set your NGC API key
export NGC_API_KEY="nvapi-..."
# VLM (Vision-Language Model) configuration - cloud hosted
export APP_VLM_MODELNAME="nvidia/nemotron-nano-12b-v2-vl"
export APP_VLM_SERVERURL="https://integrate.api.nvidia.com"
export APP_LLM_SERVERURL=""
# Multimodal embedding model configuration - cloud hosted
export APP_EMBEDDINGS_MODELNAME="nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
export APP_EMBEDDINGS_SERVERURL="https://integrate.api.nvidia.com/v1"# Configure image extraction
export APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY=""
export APP_NVINGEST_IMAGE_ELEMENTS_MODALITY="image"
export APP_NVINGEST_EXTRACTIMAGES="True"
# Disable reranker (not supported with multimodal queries)
export APP_RANKING_SERVERURL=""docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d --buildVerify the ingestor server is healthy
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --buildVerify the RAG server is healthy
Check the status of all deployed containers
docker ps --format "table {{.Names}}\t{{.Status}}"You should see output similar to the following:
NAMES STATUS
compose-nv-ingest-ms-runtime-1 Up 5 minutes (healthy)
ingestor-server Up 5 minutes
compose-redis-1 Up 5 minutes
rag-frontend Up 9 minutes
rag-server Up 9 minutes
milvus-standalone Up 36 minutes
milvus-minio Up 35 minutes (healthy)
milvus-etcd Up 35 minutes (healthy)
After deployment, you can start querying your knowledge base with both text and images.
-
Web UI: Access the RAG frontend at
http://localhost:8090to experiment with multimodal queries through the user interface. For details, see User Interface for NVIDIA RAG Blueprint. -
Interactive Notebook: For a step-by-step guide with code examples covering collection creation, document ingestion, and querying with images, see the Multimodal Query Notebook.
- Reranker not supported: The reranker must be disabled (
enable_reranker: False) for multimodal queries. - Single-page retrieval for image queries: When an image is included in the query, the retrieval results are constrained to content from a single page per document. Multi-page context retrieval is not supported for image-based queries.