Add Intel Xeon support (SPR+) for RAG [dev] by tpawlows · Pull Request #127 · rh-ai-quickstart/RAG

tpawlows · 2026-01-12T07:50:20Z

Same as #126, but for dev branch.

~~Please merge it after rh-ai-quickstart/ai-architecture-charts#137 is merged.~~ - merged ✅

This PR extends the quickstart with an option to deploy RAG on Intel Xeon for balanced price/performance.
- It is similar to the CPU deployment but uses a container image with an optimized vLLM for Xeon that leverages AVX512 and AMX instruction extensions for improved inference (opea/vllm-cpu-ubi:v0.12.0-ubi9).
By default, requires OpenShift cluster with at least one worker node that is using SPR or newer generation CPU with more than 16vCPU and 64GiB of memory, e.g. m8i.8xlarge (32vCPU128GiB SPR), m8i.8xlarge ( 32vCPU128GiB GNR)
Validated models:
- meta-llama/Llama-3.2-3B-Instruct
- meta-llama/Llama-3.1-8B-Instruct
Added example Xeon configuration to deploy/helm/rag/values.yaml
Updated README.md

To deploy use flag DEVICE=xeon:

# Xeon deployment
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=xeon

Succesfully deployed and tested with Add Xeon deployment support to llm-service Helm chart ai-architecture-charts#137

tpawlows added 5 commits January 12, 2026 08:45

Add example config of llama-3-2-3b-instruct for Xeon deployment

f33f90b

Add Xeon section to README.md and update values min requirements

4b4e041

Minor README update

a13a76c

add llama-3-1-8b-instruct example for xeon

23ceb8e

Split HW section in supported models table, add N/A for HPU

6029868

yuvalturg merged commit 16e8639 into rh-ai-quickstart:dev Jan 12, 2026

Provide feedback