The following will deploy the operator from our source on quay.io:
$ oc login <path to the cluster>
$ NAMESPACE=<your-namespace> RELEASE_VERSION=1.0.20 GPU_COUNT=1 make deploy
... or omit the GPU_COUNT to install in demo mode
$ NAMESPACE=<your-namespace> make clean-cluster
- Find a region where your desired machine is available. For example:
will return a table of virtual machine types available in various Azure regions.
az vm list-skus --resource-type virtualMachines --output table - Create a GPU-enabled machineset. Instructions are here
- Add a gpu count label to the new machineset:
oc label machineset <machineset-name> gpus=<num-gpus> - Add an availability zone label to the new machineset:
oc label machineset <machineset-name> az=<availability zone> - Scale up one GPU-enabled node:
... wait for it to start and for a node to be assigned to the machine
oc scale --namespace <machineset-namesapce> <machineset-name> --replicas=1 - Install the Node Feature Discovery (NFD) operator to the cluster. Instructions are here
- Check that NFD is identifying NVIDIA GPUs. When NFD is working, the following will return rows:
oc describe nodes | grep "pci-10de.present=true" - Install RedHat's Special Resource Operator to the cluster. It's not in the OperatorHub yet, so install from source:
-
Get source:
git clone https://github.com/openshift-psap/special-resource-operator cd special-resource-operator git checkout master -
Edit assets/0000-state-driver-buildconfig.yaml to change spec.strategy.dockerStrategy.buildArgs to be:
- name: "DRIVER_VERSION" value: "410.129-diagnostic" - name: "SHORT_DRIVER_VERSION" value: "410.129" -
Install it:
make deploy -
Wait for it to mark the nodes with their gpu counts. The following should return rows:
oc describe nodes | grep nvidia.com/gpuNote that it'll take a while since it needs to build drivers from source.
-
oc scale machinesets --namespace <machineset-namespace> --selector 'gpus' --replicas=0
For example, to start a 1-gpu machineset in us-east-2a:
oc scale machinesets.machine.openshift.io -n openshift-machine-api --selector="gpus=1,az=us-east-2a" --replicas=1
Wait for the SRO to recognize the GPUs:
oc describe nodes | grep nvidia.com/gpu
Note that it'll take a while since it needs to build drivers from source.
If it takes more than 15 minutes, you can try restarting the SRO:
oc scale deployment -n openshift-sro special-resource-operator --replicas=0
oc scale deployment -n openshift-sro special-resource-operator --replicas=1
... and then watch for it to recognize the GPUs