Skip to content

feat(sandbox): switch device plugin to CDI injection mode#503

Open
elezar wants to merge 4 commits intomainfrom
feat/device-plugin-cdi-injection
Open

feat(sandbox): switch device plugin to CDI injection mode#503
elezar wants to merge 4 commits intomainfrom
feat/device-plugin-cdi-injection

Conversation

@elezar
Copy link
Member

@elezar elezar commented Mar 20, 2026

Summary

Configure the NVIDIA device plugin to use deviceListStrategy: cdi-cri so GPU devices are injected via direct CDI device requests in the CRI. Sandbox pods now only need nvidia.com/gpu: 1 in their resource limits — runtimeClassName is no longer set on GPU pods.

Related Issue

Related to #398

Changes

  • deploy/docker/Dockerfile.images: remove unneeded NVIDIA Container Toolkit components from gateway image.
  • deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml: add deviceListStrategy: cdi-cri, cdi.nvidiaHookPath, and nvidiaDriverRoot: "/" to Helm values
  • crates/openshell-server/src/sandbox/mod.rs: remove runtimeClassName insertion for GPU pods in both sandbox_template_to_k8s() and inject_pod_template(); add unit test asserting CDI path sets no runtimeClassName
  • architecture/gateway-single-node.md: update GPU Enablement section to document CDI injection mode
  • .agents/skills/debug-openshell-cluster/SKILL.md: add Step 8 with CDI-specific diagnostics (nvidia-ctk cdi list, device plugin logs, CDI spec files)

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (not applicable — CDI-specific assertion not observable from inside the sandbox)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated

@elezar elezar self-assigned this Mar 20, 2026
@elezar elezar force-pushed the feat/device-plugin-cdi-injection branch from 6840a21 to 6682d7d Compare March 20, 2026 17:00
@elezar elezar force-pushed the feat/device-plugin-cdi-injection branch from 6682d7d to 9c39785 Compare March 20, 2026 20:47
@elezar elezar marked this pull request as ready for review March 20, 2026 20:47
@elezar elezar requested a review from a team as a code owner March 20, 2026 20:47
serde_json::json!(GPU_RUNTIME_CLASS_NAME),
);
} else if !template.runtime_class_name.is_empty() {
if !template.runtime_class_name.is_empty() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this validation needs to be updated as well, since the runtime class is no longer going to be set:

let runtime_classes: Api<DynamicObject> = Api::all_with(
self.client.clone(),
&ApiResource::from_gvk(&GroupVersionKind::gvk("node.k8s.io", "v1", "RuntimeClass")),
);
let runtime_class_exists = runtime_classes
.get_opt(GPU_RUNTIME_CLASS_NAME)
.await
.map_err(|err| {
tonic::Status::internal(format!("check GPU runtime class failed: {err}"))
})?
.is_some();
if !runtime_class_exists {
return Err(tonic::Status::failed_precondition(
"GPU sandbox requested, but the active gateway is not GPU-enabled. To start a gateway with GPU support run: `openshell gateway start --gpu`",
));
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. The tests pass in this case because the nvidia runtime class still exists on a GPU-enabled k3s cluster, but this is not a requirement for running GPU-eabled sandboxes.

(Note that we do currently still rely on the runtimeClass to deploy the NVIDIA Device Plugin with GPU access).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the runtime class check in validate_gpu_support.

@elezar elezar force-pushed the feat/device-plugin-cdi-injection branch 2 times, most recently from da63c43 to 590e002 Compare March 25, 2026 11:09
elezar added 4 commits March 25, 2026 15:41
Configure the NVIDIA device plugin to use deviceListStrategy=cdi-cri so
that GPU devices are injected via direct CDI device requests in the CRI.
Sandbox pods now only require the nvidia.com/gpu resource request —
runtimeClassName is no longer set on GPU pods.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Using index-based device IDs improves compatibility across platforms
including Jetson/Tegra-based and WSL2-based system.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
For newer NVIDIA Container Toolkit versions, the components installed
through the nvidia-container-toolkit, libnvidia-container-tools, and
libnvidia-container1 packages are considered legacy. In CDI mode -- or
when native CDI is used -- only the nvidia-container-toolkit-base
package is required with the notable components being:

* nvidia-ctk - The general purpose NVIDIA Container Toolkit CLI. It
  includes functionality such as nvidia-ctk cdi generate to generate
  CDI specifications and nvidia-ctk cdi list to show available CDI
  devices.
* nvidia-cdi-hook - Implements specific container lifecycle hooks used
  to ensure that a container is set up correctly to allow GPU access
  after device nodes and driver files are injected using CDI.
  This CLI is aliased by the `nvidia-ctk hook` subcommand.
* nvidia-container-runtime - As a wrapper for runc to add GPU support
  in environments where direct CDI device requests are not possible.
  This includes k3s, where the nvidia RuntimeClass is added automatically
  if thie nvidia-container-runtime is detected and used to ensure the
  injection of device nodes and libraries for the k8s-device-plugin
  containers.

This change also renames the Docker build stage to nvidia-container-toolkit
explicitly for clarity.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/device-plugin-cdi-injection branch from a063051 to 9555515 Compare March 25, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants