Background:
We are using gpu-operator on GKE (COS) that already comes with toolkit and device plugin installed.
I have gone through this page: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html
When we let the gpu operator install and manage the toolkit we are facing instability issues in containerd (frequent restarts) and image pull errors - This is a separate issue of its own - Maybe this is due to COS being restrictive and it does not like configurations being changed.
Hence we are setting toolkit.enabled and devicePlugin.enabled as false to use whatever GKE already provides us.
The only problem is toolkit-validation in operator-validator errors out.
Root cause:
toolkit-validation runs with securityContext: privileged and asks for gpus using NVIDIA_VISIBLE_DEVICES: all.
Normally toolkit + device plugin honor this and inject nvidia-smi and the gpus.
But GKE's toolkit does not inject nvidia-smi until the container explicitly requests nvidia.com/gpu > 0
We would like the option to skip toolkit validation or any validation component
Full set of values here: https://github.com/truefoundry/infra-charts/blob/2c3b765c57acb36fd6cf0a4b7c4f8da47d0ed15b/charts/tfy-gpu-operator/values.yaml#L516-L888
Background:
We are using gpu-operator on GKE (COS) that already comes with toolkit and device plugin installed.
I have gone through this page: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html
When we let the gpu operator install and manage the toolkit we are facing instability issues in containerd (frequent restarts) and image pull errors - This is a separate issue of its own - Maybe this is due to COS being restrictive and it does not like configurations being changed.
Hence we are setting
toolkit.enabledanddevicePlugin.enabledasfalseto use whatever GKE already provides us.The only problem is
toolkit-validationinoperator-validatorerrors out.Root cause:
toolkit-validationruns withsecurityContext: privilegedand asks for gpus usingNVIDIA_VISIBLE_DEVICES: all.Normally toolkit + device plugin honor this and inject
nvidia-smiand the gpus.But GKE's toolkit does not inject
nvidia-smiuntil the container explicitly requestsnvidia.com/gpu > 0We would like the option to skip toolkit validation or any validation component
Full set of values here: https://github.com/truefoundry/infra-charts/blob/2c3b765c57acb36fd6cf0a4b7c4f8da47d0ed15b/charts/tfy-gpu-operator/values.yaml#L516-L888