This guide will outline the necessary steps for deploying and running the RAG QuickStart on an OpenShift cluster.
- Prerequisites
- Supported Models
- Installing the RAG QuickStart
- Using the RAG UI
- Environment Variables
- Adding a new model
- Uninstalling the RAG QuickStart
- OpenShift Cluster 4.16+ with OpenShift AI
- OpenShift Client CLI - oc
- Helm CLI - helm
- huggingface-cli (optional)
- 1 GPU/HPU with 24GB of VRAM for the LLM, refer to the chart below
- 1 GPU/HPU with 24GB of VRAM for the safety/shield model (optional)
- Hugging Face Token
- Access to Meta Llama model.
- Access to Meta Llama Guard model.
- Some of the example scripts use
jqa JSON parsing utility which you can acquire viabrew install jq
| Function | Model Name | Hardware | AWS |
|---|---|---|---|
| Embedding | all-MiniLM-L6-v2 |
CPU/GPU/HPU | |
| Generation | meta-llama/Llama-3.2-3B-Instruct |
L4/HPU | g6.2xlarge |
| Generation | meta-llama/Llama-3.1-8B-Instruct |
L4/HPU | g6.2xlarge |
| Generation | meta-llama/Meta-Llama-3-70B-Instruct |
A100 x2/HPU | p4d.24xlarge |
| Safety | meta-llama/Llama-Guard-3-8B |
L4/HPU | g6.2xlarge |
Note: the 70B model is NOT required for initial testing of this example. The safety/shield model Llama-Guard-3-8B is also optional.
Clone the repo so you have a working copy
git clone https://github.com/rh-ai-quickstart/RAGLogin to your OpenShift Cluster
oc login --server="<cluster-api-endpoint>" --token="sha256~XYZ"Determine what hardware acceleration is available in your cluster and configure accordingly.
For NVIDIA GPU nodes: If GPU nodes are tainted, find the taint key. In the example below the key for the taint is nvidia.com/gpu
oc get nodes -l nvidia.com/gpu.present=true -o yaml | grep -A 3 taint For Intel Gaudi HPU nodes: If HPU nodes are tainted, find the taint key. The taint key is typically habana.ai/gaudi
oc get nodes -l habana.ai/gaudi.present=true -o yaml | grep -A 3 taint The output of either command may be something like below:
taints:
- effect: NoSchedule
key: nvidia.com/gpu # or habana.ai/gaudi for HPU
value: "true"
You can work with your OpenShift cluster admin team to determine what labels and taints identify GPU-enabled or HPU-enabled worker nodes. It is also possible that all your worker nodes have accelerators therefore have no distinguishing taint.
Navigate to Helm deploy directory
cd deploy/helmList available models
make list-modelsThe above command will list the models to use in the next command
(Output)
model: llama-3-1-8b-instruct (meta-llama/Llama-3.1-8B-Instruct)
model: llama-3-2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
model: llama-3-2-1b-instruct-quantized (RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8)
model: llama-3-2-3b-instruct (meta-llama/Llama-3.2-3B-Instruct)
model: llama-3-3-70b-instruct (meta-llama/Llama-3.3-70B-Instruct)
model: llama-guard-3-1b (meta-llama/Llama-Guard-3-1B)
model: llama-guard-3-8b (meta-llama/Llama-Guard-3-8B)The "guard" models can be used to test shields for profanity, hate speech, violence, etc.
You can configure your deployment in two ways:
- Using a configuration file (recommended for complex setups, multiple models, or persistent configuration)
- Using command-line parameters (quick deployments, see step 7 Option B)
To use the configuration file approach, initialize it. This will create a rag-values.yaml file from the example template:
make init-configThe system will display a configuration banner prompting you to edit the file. Open a new terminal window and edit the configuration:
# Edit with your preferred editor
nano rag-values.yaml
# or
vim rag-values.yamlImportant: You MUST configure at least:
- Enable at least ONE model in the
global.modelssection by settingenabled: true - Add your Hugging Face token (get it from https://huggingface.co/settings/tokens)
- (Optional) Add your TAVILY API key for web search functionality
- (Optional) Configure tolerations if your nodes are tainted (see step 3)
Example model configuration:
global:
models:
llama-3-2-3b-instruct:
id: meta-llama/Llama-3.2-3B-Instruct
enabled: true
device: "gpu" # Options: "cpu", "gpu", "hpu"
resources:
limits:
nvidia.com/gpu: "1"
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
effect: NoScheduleQuick configuration commands:
# Interactively configure API keys
make configure-keys
# View current configuration
make show-config
# Validate configuration
make validate-configThere are two ways to deploy: using the configuration file or using command-line parameters.
After configuring the rag-values.yaml file in step 6, deploy using make:
make install NAMESPACE=llama-stack-ragThe system will:
- Validate your configuration
- Prompt for any missing API keys (Hugging Face token, TAVILY key)
- Deploy all configured services using the models and settings from
rag-values.yaml
You can also deploy by passing configuration parameters directly via the command line. This approach is useful for quick deployments or CI/CD pipelines.
GPU Deployment Examples (Default):
To install only the RAG example, no shields:
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu"To install both the RAG example and the guard model for shields:
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="nvidia.com/gpu"Note: DEVICE=gpu is the default and can be omitted.
Intel Gaudi HPU Deployment Examples:
To install only the RAG example on Intel Gaudi HPU nodes:
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="habana.ai/gaudi" DEVICE=hpuTo install both the RAG example and guard model on Intel Gaudi HPU nodes:
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="habana.ai/gaudi" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="habana.ai/gaudi" DEVICE=hpuCPU Deployment Example:
To install on CPU nodes only:
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct DEVICE=cpuSimplified Commands (No Tolerations Needed):
If you have no tainted nodes (all worker nodes have accelerators), you can use simplified commands:
# GPU deployment (default - DEVICE=gpu can be omitted)
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b
# HPU deployment
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=hpu
# CPU deployment
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=cpuNote: When using command-line parameters, the rag-values.yaml file will still be created from the example template if it doesn't exist. Command-line parameters will override the model settings in the values file.
When prompted, enter your Hugging Face Token.
This process may take 10 to 30 minutes depending on the number and size of models to be downloaded.
Watch/Monitor
oc get pods -n llama-stack-rag(Output)
NAME READY STATUS RESTARTS AGE
demo-rag-vector-db-v1-0-8mkf9 0/1 Completed 0 10m
ds-pipeline-dspa-7788689675-9489m 2/2 Running 0 10m
ds-pipeline-metadata-envoy-dspa-948676f89-8knw8 2/2 Running 0 10m
ds-pipeline-metadata-grpc-dspa-7b4bf6c977-cb72m 1/1 Running 0 10m
ds-pipeline-persistenceagent-dspa-ff9bdfc76-ngddb 1/1 Running 0 10m
ds-pipeline-scheduledworkflow-dspa-7b64d87fd8-58d87 1/1 Running 0 10m
ds-pipeline-workflow-controller-dspa-5799548b68-bxpdp 1/1 Running 0 10m
fetch-and-store-pipeline-tmxwj-system-container-driver-287597120 0/2 Completed 0 3m43s
fetch-and-store-pipeline-tmxwj-system-container-driver-922184592 0/2 Completed 0 2m54s
fetch-and-store-pipeline-tmxwj-system-container-impl-3210250134 0/2 Completed 0 4m33s
fetch-and-store-pipeline-tmxwj-system-container-impl-3248801382 0/2 Completed 0 3m32s
fetch-and-store-pipeline-tmxwj-system-dag-driver-3443954210 0/2 Completed 0 4m6s
llama-3-2-3b-instruct-predictor-00001-deployment-6bbf96f8674677 3/3 Running 0 10m
llamastack-6d5c5b999b-5lffb 1/1 Running 0 11m
mariadb-dspa-74744d65bd-fdxjd 1/1 Running 0 10m
minio-0 1/1 Running 0 10m
minio-dspa-7bb47d68b4-nvw7t 1/1 Running 0 10m
pgvector-0 1/1 Running 0 10m
rag-7fd7b47844-nlfvr 1/1 Running 0 11m
rag-mcp-weather-9cc97d574-nf5q8 1/1 Running 0 11m
rag-pipeline-notebook-0 2/2 Running 0 10m
upload-sample-docs-job-f5k5w 0/1 Completed 0 10m
Verify deployment:
oc get pods -n llama-stack-rag
oc get svc -n llama-stack-rag
oc get routes -n llama-stack-ragNote: The key pods to watch include predictor in their name, those are the kserve model servers running vLLM
oc get pods -l component=predictorLook for 3/3 under the Ready column
The inferenceservice CR describes the limits, requests, model name, serving-runtime, chat-template, etc.
oc get inferenceservice llama-3-2-3b-instruct \
-n llama-stack-rag \
-o jsonpath='{.spec.predictor.model}' | jqWatch the llamastack pod as that one becomes available after all the model servers are up.
oc get pods -l app.kubernetes.io/name=llamastackNavigate to OpenShift AI Dashboard and verify the deployment:
- Get the OpenShift AI Dashboard route:
oc get routes rhods-dashboard -n redhat-ods-applications- Login to the OpenShift AI Dashboard and find the
llama-stack-ragproject.
- You should see a running workbench with Jupyter Notebook.
If you want to use the pre-ingestion pipeline for batch document processing, configure Kubeflow Pipelines with object storage:
Get MinIO credentials:
MINIO_API="https://$(oc get route minio-api -o jsonpath='{.spec.host}')"
B64_USER=$(oc get secret minio -o jsonpath='{.data.username}')
MINIO_USER=$(echo $B64_USER | base64 --decode)
echo "user: $MINIO_USER"
B64_PASSWORD=$(oc get secret minio -o jsonpath='{.data.password}')
MINIO_PASSWORD=$(echo $B64_PASSWORD | base64 --decode)
echo "password: $MINIO_PASSWORD"Configure Kubeflow Pipeline:
Navigate to Kubeflow Pipelines in OpenShift AI and configure with these values:
- Access Key:
minio_rag_user(value of$MINIO_USER) - Secret Key:
minio_rag_password(value of$MINIO_PASSWORD) - Endpoint: Value of
$MINIO_API - Region:
us-east-1 - Bucket:
documents
Using MinIO CLI (optional):
Install MinIO CLI:
brew install minio/stable/mcConfigure MinIO alias:
mc alias set minio $MINIO_API $MINIO_USER $MINIO_PASSWORDCreate bucket (if not present):
mc mb minio/documentsUpload documents:
mc cp ~/my-documents/my.pdf minio/documentsAccess MinIO WebUI:
MINIO_WEB="https://$(oc get route minio-webui -o jsonpath='{.spec.host}')"
open $MINIO_WEBOnce configured, you can run the ingestion pipeline from the Jupyter notebook:
This will create pipelines and runs in Kubeflow:
To verify that documents have been successfully embedded and stored:
oc exec -it pgvector-0 -- psql -d rag_blueprint -U postgres-- List tables
\dt
-- View vector store structure
\d+ vector_store_rag_vector_db
-- Count embedded documents
SELECT COUNT(*) FROM vector_store_rag_vector_db;Example output:
List of relations
Schema | Name | Type | Owner
--------+----------------------------+-------+----------
public | metadata_store | table | postgres
public | vector_store_rag_vector_db | table | postgres
count
-------
154
- Get the route url for the application and open in your browser
URL=http://$(oc get routes -l app.kubernetes.io/name=rag -o jsonpath="{range .items[*]}{.status.ingress[0].host}{end}")
echo $URL
open $URL-
Click on Upload Documents
-
Upload your PDF document
-
Name and Create a Vector Database
- Once you've recieved Vector database created successfully!, navigate back to Chat and select the newly created vector db.
- Ask a question pertaining to your document!
For batch document ingestion using Kubeflow Pipelines, refer to the Verify Installation section above.
The RAG application uses environment variables for configuration. These are managed through the Helm values file (deploy/helm/rag/values.yaml).
| Environment Variable | Description | Default Value | Configuration Location |
|---|---|---|---|
LLAMA_STACK_ENDPOINT |
The endpoint for the Llama Stack API server | http://llamastack:8321 |
env: section in values.yaml |
| Environment Variable | Description | Default Value | Configuration Location |
|---|---|---|---|
TAVILY_SEARCH_API_KEY |
API key for Tavily search provider (optional) | Paste-your-key-here |
llama-stack.secrets: in values.yaml |
FIREWORKS_API_KEY |
API key for Fireworks AI provider (optional) | (not set) | llama-stack.secrets: in values.yaml |
TOGETHER_API_KEY |
API key for Together AI provider (optional) | (not set) | llama-stack.secrets: in values.yaml |
SAMBANOVA_API_KEY |
API key for SambaNova provider (optional) | (not set) | llama-stack.secrets: in values.yaml |
OPENAI_API_KEY |
API key for OpenAI provider (optional) | (not set) | llama-stack.secrets: in values.yaml |
To set environment variables, edit deploy/helm/rag/values.yaml before installation:
For RAG UI variables:
env:
- name: LLAMA_STACK_ENDPOINT
value: 'http://llamastack:8321'For Llama Stack secrets (API keys):
llama-stack:
secrets:
TAVILY_SEARCH_API_KEY: "your-actual-api-key-here"
FIREWORKS_API_KEY: "your-fireworks-key"
TOGETHER_API_KEY: "your-together-key"Note: For the default deployment, only TAVILY_SEARCH_API_KEY may be needed if you want to enable web search capabilities. Other API keys are only required if you want to use external AI providers.
After modifying the values, redeploy using the same make install command.
To add another model follow these steps:
-
Edit
deploy/helm/rag-values.yaml(your configuration file)Update the global.models section
global: models: granite-vision-3-2-2b: id: ibm-granite/granite-vision-3.2-2b enabled: true resources: limits: nvidia.com/gpu: "1" tolerations: - key: "nvidia.com/gpu" operator: Exists effect: NoSchedule args: - --tensor-parallel-size - "1" - --max-model-len - "6144" - --enable-auto-tool-choice - --tool-call-parser - granite llama-guard-3-8b: id: meta-llama/Llama-Guard-3-8B enabled: true registerShield: true tolerations: - key: "nvidia.com/gpu" operator: Exists effect: NoSchedule args: - --max-model-len - "14336"
Note: Make sure you have permission to download the models from Huggingface and enough GPUs to support all the models you have requested. Also max-model-len uses additional VRAM therefore you have to scale that parameter to fit your hardware.
-
Run the make command again to update the project
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu"(Output) NAME READY STATUS RESTARTS AGE demo-rag-vector-db-v1-0-vz5mf 0/1 Completed 0 35m ds-pipeline-dspa-6dcf8c7b8f-vkhw8 2/2 Running 1 (34m ago) 34m ds-pipeline-metadata-envoy-dspa-7659ddc8d9-qvtct 2/2 Running 0 34m ds-pipeline-metadata-grpc-dspa-8665cd5c6c-mfrj7 1/1 Running 0 34m ds-pipeline-persistenceagent-dspa-56f888bc78-lzq9s 1/1 Running 0 34m ds-pipeline-scheduledworkflow-dspa-c94d5c95d-rr8td 1/1 Running 0 34m ds-pipeline-workflow-controller-dspa-5799548b68-z2lcl 1/1 Running 0 34m fetch-and-store-pipeline-w7gxh-system-container-driver-1552269565 0/2 Completed 0 30m fetch-and-store-pipeline-w7gxh-system-container-driver-2057025395 0/2 Completed 0 30m fetch-and-store-pipeline-w7gxh-system-container-impl-1487941461 0/2 Completed 0 30m fetch-and-store-pipeline-w7gxh-system-container-impl-883889707 0/2 Completed 0 29m fetch-and-store-pipeline-w7gxh-system-dag-driver-190510417 0/2 Completed 0 30m granite-vision-3-2-2b-predictor-00001-deployment-5dbcf6f454mrd6 3/3 Running 0 10m granite-vision-3-2-2b-predictor-00001-deployment-5dbcf6f45xxk5x 0/3 ContainerStatusUnknown 3 13m llama-3-2-3b-instruct-predictor-00001-deployment-6f845f65674ncq 3/3 Running 0 35m llama-guard-3-8b-predictor-00001-deployment-6cbff4965c-gzx5v 3/3 Running 0 13m llamastack-7989d974fc-w24fn 1/1 Running 0 13m mariadb-dspa-74744d65bd-kb2dh 1/1 Running 0 35m minio-0 1/1 Running 0 35m minio-dspa-7bb47d68b4-kb722 1/1 Running 0 35m pgvector-0 1/1 Running 0 35m rag-7fd7b47844-jkqtf 1/1 Running 0 35m rag-mcp-weather-9cc97d574-s8vpt 1/1 Running 0 35m rag-pipeline-notebook-0 2/2 Running 0 35m upload-sample-docs-job-952gj 0/1 Completed 0 35m
Return to the RAG UI and look into the Inspect tab to see the additional models and shields.
The newly added shield can be tested via the UI by selecting Agent-based and Chat
make uninstall NAMESPACE=llama-stack-ragor
oc delete project llama-stack-rag












