OpenShift Deployment

This guide will outline the necessary steps for deploying and running the RAG QuickStart on an OpenShift cluster.

Prerequisites
Supported Models
Installing the RAG QuickStart
Using the RAG UI
Environment Variables
Adding a new model
Uninstalling the RAG QuickStart

Prerequisites

OpenShift Cluster 4.16+ with OpenShift AI
OpenShift Client CLI - oc
Helm CLI - helm
huggingface-cli (optional)
1 GPU/HPU with 24GB of VRAM for the LLM, refer to the chart below
1 GPU/HPU with 24GB of VRAM for the safety/shield model (optional)
Hugging Face Token
Access to Meta Llama model.
Access to Meta Llama Guard model.
Some of the example scripts use jq a JSON parsing utility which you can acquire via brew install jq

Supported Models

Function	Model Name	Hardware	AWS
Embedding	`all-MiniLM-L6-v2`	CPU/GPU/HPU
Generation	`meta-llama/Llama-3.2-3B-Instruct`	L4/HPU	g6.2xlarge
Generation	`meta-llama/Llama-3.1-8B-Instruct`	L4/HPU	g6.2xlarge
Generation	`meta-llama/Meta-Llama-3-70B-Instruct`	A100 x2/HPU	p4d.24xlarge
Safety	`meta-llama/Llama-Guard-3-8B`	L4/HPU	g6.2xlarge

Note: the 70B model is NOT required for initial testing of this example. The safety/shield model Llama-Guard-3-8B is also optional.

Installing the RAG QuickStart

1. Clone Repository

Clone the repo so you have a working copy

git clone https://github.com/rh-ai-quickstart/RAG

2. Login to OpenShift

oc login --server="<cluster-api-endpoint>" --token="sha256~XYZ"

3. Hardware Configuration

Determine what hardware acceleration is available in your cluster and configure accordingly.

For NVIDIA GPU nodes: If GPU nodes are tainted, find the taint key. In the example below the key for the taint is nvidia.com/gpu

oc get nodes -l nvidia.com/gpu.present=true -o yaml | grep -A 3 taint

For Intel Gaudi HPU nodes: If HPU nodes are tainted, find the taint key. The taint key is typically habana.ai/gaudi

oc get nodes -l habana.ai/gaudi.present=true -o yaml | grep -A 3 taint

The output of either command may be something like below:

taints:
  - effect: NoSchedule
    key: nvidia.com/gpu  # or habana.ai/gaudi for HPU
    value: "true"

You can work with your OpenShift cluster admin team to determine what labels and taints identify GPU-enabled or HPU-enabled worker nodes. It is also possible that all your worker nodes have accelerators therefore have no distinguishing taint.

4. Navigate to Deployment Directory

Navigate to Helm deploy directory

cd deploy/helm

5. List Available Models

List available models

make list-models

The above command will list the models to use in the next command

(Output)
model: llama-3-1-8b-instruct (meta-llama/Llama-3.1-8B-Instruct)
model: llama-3-2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
model: llama-3-2-1b-instruct-quantized (RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8)
model: llama-3-2-3b-instruct (meta-llama/Llama-3.2-3B-Instruct)
model: llama-3-3-70b-instruct (meta-llama/Llama-3.3-70B-Instruct)
model: llama-guard-3-1b (meta-llama/Llama-Guard-3-1B)
model: llama-guard-3-8b (meta-llama/Llama-Guard-3-8B)

The "guard" models can be used to test shields for profanity, hate speech, violence, etc.

6. Initialize Configuration (Recommended for Fine-Grained Control)

You can configure your deployment in two ways:

Using a configuration file (recommended for complex setups, multiple models, or persistent configuration)
Using command-line parameters (quick deployments, see step 7 Option B)

To use the configuration file approach, initialize it. This will create a rag-values.yaml file from the example template:

make init-config

The system will display a configuration banner prompting you to edit the file. Open a new terminal window and edit the configuration:

# Edit with your preferred editor
nano rag-values.yaml
# or
vim rag-values.yaml

Important: You MUST configure at least:

Enable at least ONE model in the global.models section by setting enabled: true
Add your Hugging Face token (get it from https://huggingface.co/settings/tokens)
(Optional) Add your TAVILY API key for web search functionality
(Optional) Configure tolerations if your nodes are tainted (see step 3)

Example model configuration:

global:
  models:
    llama-3-2-3b-instruct:
      id: meta-llama/Llama-3.2-3B-Instruct
      enabled: true
      device: "gpu"  # Options: "cpu", "gpu", "hpu"
      resources:
        limits:
          nvidia.com/gpu: "1"
      tolerations:
      - key: "nvidia.com/gpu"
        operator: Exists
        effect: NoSchedule

Quick configuration commands:

# Interactively configure API keys
make configure-keys

# View current configuration
make show-config

# Validate configuration
make validate-config

7. Deploy with Helm

There are two ways to deploy: using the configuration file or using command-line parameters.

Option A: Deploy with Configuration File (Recommended)

After configuring the rag-values.yaml file in step 6, deploy using make:

make install NAMESPACE=llama-stack-rag

The system will:

Validate your configuration
Prompt for any missing API keys (Hugging Face token, TAVILY key)
Deploy all configured services using the models and settings from rag-values.yaml

Option B: Deploy with Command-Line Parameters

You can also deploy by passing configuration parameters directly via the command line. This approach is useful for quick deployments or CI/CD pipelines.

GPU Deployment Examples (Default):

To install only the RAG example, no shields:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu"

To install both the RAG example and the guard model for shields:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="nvidia.com/gpu"

Note: DEVICE=gpu is the default and can be omitted.

Intel Gaudi HPU Deployment Examples:

To install only the RAG example on Intel Gaudi HPU nodes:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="habana.ai/gaudi" DEVICE=hpu

To install both the RAG example and guard model on Intel Gaudi HPU nodes:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="habana.ai/gaudi" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="habana.ai/gaudi" DEVICE=hpu

CPU Deployment Example:

To install on CPU nodes only:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct DEVICE=cpu

Simplified Commands (No Tolerations Needed):

If you have no tainted nodes (all worker nodes have accelerators), you can use simplified commands:

# GPU deployment (default - DEVICE=gpu can be omitted)
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b

# HPU deployment  
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=hpu

# CPU deployment
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=cpu

Note: When using command-line parameters, the rag-values.yaml file will still be created from the example template if it doesn't exist. Command-line parameters will override the model settings in the values file.

When prompted, enter your Hugging Face Token.

This process may take 10 to 30 minutes depending on the number and size of models to be downloaded.

8. Monitor Deployment

Watch/Monitor

oc get pods -n llama-stack-rag

(Output)
NAME                                                               READY   STATUS      RESTARTS   AGE
demo-rag-vector-db-v1-0-8mkf9                                      0/1     Completed   0          10m
ds-pipeline-dspa-7788689675-9489m                                  2/2     Running     0          10m
ds-pipeline-metadata-envoy-dspa-948676f89-8knw8                    2/2     Running     0          10m
ds-pipeline-metadata-grpc-dspa-7b4bf6c977-cb72m                    1/1     Running     0          10m
ds-pipeline-persistenceagent-dspa-ff9bdfc76-ngddb                  1/1     Running     0          10m
ds-pipeline-scheduledworkflow-dspa-7b64d87fd8-58d87                1/1     Running     0          10m
ds-pipeline-workflow-controller-dspa-5799548b68-bxpdp              1/1     Running     0          10m
fetch-and-store-pipeline-tmxwj-system-container-driver-287597120   0/2     Completed   0          3m43s
fetch-and-store-pipeline-tmxwj-system-container-driver-922184592   0/2     Completed   0          2m54s
fetch-and-store-pipeline-tmxwj-system-container-impl-3210250134    0/2     Completed   0          4m33s
fetch-and-store-pipeline-tmxwj-system-container-impl-3248801382    0/2     Completed   0          3m32s
fetch-and-store-pipeline-tmxwj-system-dag-driver-3443954210        0/2     Completed   0          4m6s
llama-3-2-3b-instruct-predictor-00001-deployment-6bbf96f8674677    3/3     Running     0          10m
llamastack-6d5c5b999b-5lffb                                        1/1     Running     0          11m
mariadb-dspa-74744d65bd-fdxjd                                      1/1     Running     0          10m
minio-0                                                            1/1     Running     0          10m
minio-dspa-7bb47d68b4-nvw7t                                        1/1     Running     0          10m
pgvector-0                                                         1/1     Running     0          10m
rag-7fd7b47844-nlfvr                                               1/1     Running     0          11m
rag-mcp-weather-9cc97d574-nf5q8                                    1/1     Running     0          11m
rag-pipeline-notebook-0                                            2/2     Running     0          10m
upload-sample-docs-job-f5k5w                                       0/1     Completed   0          10m

Verify deployment:

oc get pods -n llama-stack-rag
oc get svc -n llama-stack-rag
oc get routes -n llama-stack-rag

Note: The key pods to watch include predictor in their name, those are the kserve model servers running vLLM

oc get pods -l component=predictor

Look for 3/3 under the Ready column

The inferenceservice CR describes the limits, requests, model name, serving-runtime, chat-template, etc.

oc get inferenceservice llama-3-2-3b-instruct \
  -n llama-stack-rag \
  -o jsonpath='{.spec.predictor.model}' | jq

9. Verify Installation

Watch the llamastack pod as that one becomes available after all the model servers are up.

oc get pods -l app.kubernetes.io/name=llamastack

Verify OpenShift AI Dashboard

Navigate to OpenShift AI Dashboard and verify the deployment:

Get the OpenShift AI Dashboard route:

oc get routes rhods-dashboard -n redhat-ods-applications

You should see a running workbench with Jupyter Notebook.

Configure Kubeflow Pipelines (Optional - for Batch Document Ingestion)

If you want to use the pre-ingestion pipeline for batch document processing, configure Kubeflow Pipelines with object storage:

Reference Documentation

Get MinIO credentials:

MINIO_API="https://$(oc get route minio-api -o jsonpath='{.spec.host}')"

B64_USER=$(oc get secret minio -o jsonpath='{.data.username}')
MINIO_USER=$(echo $B64_USER | base64 --decode)
echo "user: $MINIO_USER"

B64_PASSWORD=$(oc get secret minio -o jsonpath='{.data.password}')
MINIO_PASSWORD=$(echo $B64_PASSWORD | base64 --decode)
echo "password: $MINIO_PASSWORD"

Configure Kubeflow Pipeline:

Navigate to Kubeflow Pipelines in OpenShift AI and configure with these values:

Access Key: minio_rag_user (value of $MINIO_USER)
Secret Key: minio_rag_password (value of $MINIO_PASSWORD)
Endpoint: Value of $MINIO_API
Region: us-east-1
Bucket: documents

Using MinIO CLI (optional):

Install MinIO CLI:

brew install minio/stable/mc

Configure MinIO alias:

mc alias set minio $MINIO_API $MINIO_USER $MINIO_PASSWORD

Create bucket (if not present):

mc mb minio/documents

Upload documents:

mc cp ~/my-documents/my.pdf minio/documents

Access MinIO WebUI:

MINIO_WEB="https://$(oc get route minio-webui -o jsonpath='{.spec.host}')"
open $MINIO_WEB

Once configured, you can run the ingestion pipeline from the Jupyter notebook:

This will create pipelines and runs in Kubeflow:

Verify Embeddings in PGVector (Optional)

To verify that documents have been successfully embedded and stored:

oc exec -it pgvector-0 -- psql -d rag_blueprint -U postgres

-- List tables
\dt

-- View vector store structure
\d+ vector_store_rag_vector_db

-- Count embedded documents
SELECT COUNT(*) FROM vector_store_rag_vector_db;

Example output:

                   List of relations
 Schema |            Name            | Type  |  Owner   
--------+----------------------------+-------+----------
 public | metadata_store             | table | postgres
 public | vector_store_rag_vector_db | table | postgres

 count
-------
   154

Using the RAG UI

Get the route url for the application and open in your browser

URL=http://$(oc get routes -l app.kubernetes.io/name=rag -o jsonpath="{range .items[*]}{.status.ingress[0].host}{end}")
echo $URL
open $URL

Click on Upload Documents
Upload your PDF document
Name and Create a Vector Database

Once you've recieved Vector database created successfully!, navigate back to Chat and select the newly created vector db.

Ask a question pertaining to your document!

For batch document ingestion using Kubeflow Pipelines, refer to the Verify Installation section above.

Environment Variables

The RAG application uses environment variables for configuration. These are managed through the Helm values file (deploy/helm/rag/values.yaml).

RAG UI Environment Variables

Environment Variable	Description	Default Value	Configuration Location
`LLAMA_STACK_ENDPOINT`	The endpoint for the Llama Stack API server	`http://llamastack:8321`	`env:` section in values.yaml

Llama Stack Environment Variables

Environment Variable	Description	Default Value	Configuration Location
`TAVILY_SEARCH_API_KEY`	API key for Tavily search provider (optional)	`Paste-your-key-here`	`llama-stack.secrets:` in values.yaml
`FIREWORKS_API_KEY`	API key for Fireworks AI provider (optional)	(not set)	`llama-stack.secrets:` in values.yaml
`TOGETHER_API_KEY`	API key for Together AI provider (optional)	(not set)	`llama-stack.secrets:` in values.yaml
`SAMBANOVA_API_KEY`	API key for SambaNova provider (optional)	(not set)	`llama-stack.secrets:` in values.yaml
`OPENAI_API_KEY`	API key for OpenAI provider (optional)	(not set)	`llama-stack.secrets:` in values.yaml

Configuring Environment Variables

To set environment variables, edit deploy/helm/rag/values.yaml before installation:

For RAG UI variables:

env:
  - name: LLAMA_STACK_ENDPOINT
    value: 'http://llamastack:8321'

For Llama Stack secrets (API keys):

llama-stack:
  secrets:
    TAVILY_SEARCH_API_KEY: "your-actual-api-key-here"
    FIREWORKS_API_KEY: "your-fireworks-key"
    TOGETHER_API_KEY: "your-together-key"

Note: For the default deployment, only TAVILY_SEARCH_API_KEY may be needed if you want to enable web search capabilities. Other API keys are only required if you want to use external AI providers.

After modifying the values, redeploy using the same make install command.

Adding a new model

To add another model follow these steps:

Edit deploy/helm/rag-values.yaml (your configuration file)

Update the global.models section

global:
  models:
    granite-vision-3-2-2b:
      id: ibm-granite/granite-vision-3.2-2b
      enabled: true      
      resources:
        limits:
          nvidia.com/gpu: "1"
      tolerations:
      - key: "nvidia.com/gpu"
        operator: Exists
        effect: NoSchedule
      args:
      - --tensor-parallel-size
      - "1"
      - --max-model-len
      - "6144"
      - --enable-auto-tool-choice
      - --tool-call-parser
      - granite
    llama-guard-3-8b:
      id: meta-llama/Llama-Guard-3-8B
      enabled: true
      registerShield: true
      tolerations:
      - key: "nvidia.com/gpu"
        operator: Exists
        effect: NoSchedule
      args:
      - --max-model-len
      - "14336"

Note: Make sure you have permission to download the models from Huggingface and enough GPUs to support all the models you have requested. Also max-model-len uses additional VRAM therefore you have to scale that parameter to fit your hardware.

Run the make command again to update the project

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu"

(Output)
NAME                                                                READY   STATUS                   RESTARTS      AGE
demo-rag-vector-db-v1-0-vz5mf                                       0/1     Completed                0             35m
ds-pipeline-dspa-6dcf8c7b8f-vkhw8                                   2/2     Running                  1 (34m ago)   34m
ds-pipeline-metadata-envoy-dspa-7659ddc8d9-qvtct                    2/2     Running                  0             34m
ds-pipeline-metadata-grpc-dspa-8665cd5c6c-mfrj7                     1/1     Running                  0             34m
ds-pipeline-persistenceagent-dspa-56f888bc78-lzq9s                  1/1     Running                  0             34m
ds-pipeline-scheduledworkflow-dspa-c94d5c95d-rr8td                  1/1     Running                  0             34m
ds-pipeline-workflow-controller-dspa-5799548b68-z2lcl               1/1     Running                  0             34m
fetch-and-store-pipeline-w7gxh-system-container-driver-1552269565   0/2     Completed                0             30m
fetch-and-store-pipeline-w7gxh-system-container-driver-2057025395   0/2     Completed                0             30m
fetch-and-store-pipeline-w7gxh-system-container-impl-1487941461     0/2     Completed                0             30m
fetch-and-store-pipeline-w7gxh-system-container-impl-883889707      0/2     Completed                0             29m
fetch-and-store-pipeline-w7gxh-system-dag-driver-190510417          0/2     Completed                0             30m
granite-vision-3-2-2b-predictor-00001-deployment-5dbcf6f454mrd6     3/3     Running                  0             10m
granite-vision-3-2-2b-predictor-00001-deployment-5dbcf6f45xxk5x     0/3     ContainerStatusUnknown   3             13m
llama-3-2-3b-instruct-predictor-00001-deployment-6f845f65674ncq     3/3     Running                  0             35m
llama-guard-3-8b-predictor-00001-deployment-6cbff4965c-gzx5v        3/3     Running                  0             13m
llamastack-7989d974fc-w24fn                                         1/1     Running                  0             13m
mariadb-dspa-74744d65bd-kb2dh                                       1/1     Running                  0             35m
minio-0                                                             1/1     Running                  0             35m
minio-dspa-7bb47d68b4-kb722                                         1/1     Running                  0             35m
pgvector-0                                                          1/1     Running                  0             35m
rag-7fd7b47844-jkqtf                                                1/1     Running                  0             35m
rag-mcp-weather-9cc97d574-s8vpt                                     1/1     Running                  0             35m
rag-pipeline-notebook-0                                             2/2     Running                  0             35m
upload-sample-docs-job-952gj                                        0/1     Completed                0             35m

Return to the RAG UI and look into the Inspect tab to see the additional models and shields.

The newly added shield can be tested via the UI by selecting Agent-based and Chat

Uninstalling the RAG QuickStart

make uninstall NAMESPACE=llama-stack-rag

oc delete project llama-stack-rag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenShift Deployment

Table of Contents

Prerequisites

Supported Models

Installing the RAG QuickStart

1. Clone Repository

2. Login to OpenShift

3. Hardware Configuration

4. Navigate to Deployment Directory

5. List Available Models

6. Initialize Configuration (Recommended for Fine-Grained Control)

7. Deploy with Helm

Option A: Deploy with Configuration File (Recommended)

Option B: Deploy with Command-Line Parameters

8. Monitor Deployment

9. Verify Installation

Verify OpenShift AI Dashboard

Configure Kubeflow Pipelines (Optional - for Batch Document Ingestion)

Verify Embeddings in PGVector (Optional)

Using the RAG UI

Environment Variables

RAG UI Environment Variables

Llama Stack Environment Variables

Configuring Environment Variables

Adding a new model

Uninstalling the RAG QuickStart

FilesExpand file tree

openshift_setup_guide.md

Latest commit

History

openshift_setup_guide.md

File metadata and controls

OpenShift Deployment

Table of Contents

Prerequisites

Supported Models

Installing the RAG QuickStart

1. Clone Repository

2. Login to OpenShift

3. Hardware Configuration

4. Navigate to Deployment Directory

5. List Available Models

6. Initialize Configuration (Recommended for Fine-Grained Control)

7. Deploy with Helm

Option A: Deploy with Configuration File (Recommended)

Option B: Deploy with Command-Line Parameters

8. Monitor Deployment

9. Verify Installation

Verify OpenShift AI Dashboard

Configure Kubeflow Pipelines (Optional - for Batch Document Ingestion)

Verify Embeddings in PGVector (Optional)

Using the RAG UI

Environment Variables

RAG UI Environment Variables

Llama Stack Environment Variables

Configuring Environment Variables

Adding a new model

Uninstalling the RAG QuickStart