feat(sandbox): switch device plugin to CDI injection mode by elezar · Pull Request #503 · NVIDIA/OpenShell

elezar · 2026-03-20T16:59:16Z

Summary

Configure the NVIDIA device plugin to use deviceListStrategy: cdi-cri so GPU devices are injected via direct CDI device requests in the CRI. Sandbox pods now only need nvidia.com/gpu: 1 in their resource limits — runtimeClassName is no longer set on GPU pods.

Related Issue

Related to #398

Changes

deploy/docker/Dockerfile.images: remove unneeded NVIDIA Container Toolkit components from gateway image.
deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml: add deviceListStrategy: cdi-cri, cdi.nvidiaHookPath, and nvidiaDriverRoot: "/" to Helm values
crates/openshell-server/src/sandbox/mod.rs: remove runtimeClassName insertion for GPU pods in both sandbox_template_to_k8s() and inject_pod_template(); add unit test asserting CDI path sets no runtimeClassName
architecture/gateway-single-node.md: update GPU Enablement section to document CDI injection mode
.agents/skills/debug-openshell-cluster/SKILL.md: add Step 8 with CDI-specific diagnostics (nvidia-ctk cdi list, device plugin logs, CDI spec files)

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (not applicable — CDI-specific assertion not observable from inside the sandbox)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated

pimlock · 2026-03-23T23:14:04Z

crates/openshell-server/src/sandbox/mod.rs

-            serde_json::json!(GPU_RUNTIME_CLASS_NAME),
-        );
-    } else if !template.runtime_class_name.is_empty() {
+    if !template.runtime_class_name.is_empty() {


I think this validation needs to be updated as well, since the runtime class is no longer going to be set:

OpenShell/crates/openshell-server/src/sandbox/mod.rs

Lines 130 to 147 in ef196db

let runtime_classes: Api<DynamicObject> = Api::all_with(

self.client.clone(),

&ApiResource::from_gvk(&GroupVersionKind::gvk("node.k8s.io", "v1", "RuntimeClass")),

);

let runtime_class_exists = runtime_classes

.get_opt(GPU_RUNTIME_CLASS_NAME)

.await

.map_err(|err| {

tonic::Status::internal(format!("check GPU runtime class failed: {err}"))

})?

.is_some();

if !runtime_class_exists {

return Err(tonic::Status::failed_precondition(

"GPU sandbox requested, but the active gateway is not GPU-enabled. To start a gateway with GPU support run: `openshell gateway start --gpu`",

));

}

That's a good point. The tests pass in this case because the nvidia runtime class still exists on a GPU-enabled k3s cluster, but this is not a requirement for running GPU-eabled sandboxes.

(Note that we do currently still rely on the runtimeClass to deploy the NVIDIA Device Plugin with GPU access).

I have removed the runtime class check in validate_gpu_support.

Configure the NVIDIA device plugin to use deviceListStrategy=cdi-cri so that GPU devices are injected via direct CDI device requests in the CRI. Sandbox pods now only require the nvidia.com/gpu resource request — runtimeClassName is no longer set on GPU pods. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Signed-off-by: Evan Lezar <elezar@nvidia.com>

Using index-based device IDs improves compatibility across platforms including Jetson/Tegra-based and WSL2-based system. Signed-off-by: Evan Lezar <elezar@nvidia.com>

For newer NVIDIA Container Toolkit versions, the components installed through the nvidia-container-toolkit, libnvidia-container-tools, and libnvidia-container1 packages are considered legacy. In CDI mode -- or when native CDI is used -- only the nvidia-container-toolkit-base package is required with the notable components being: * nvidia-ctk - The general purpose NVIDIA Container Toolkit CLI. It includes functionality such as nvidia-ctk cdi generate to generate CDI specifications and nvidia-ctk cdi list to show available CDI devices. * nvidia-cdi-hook - Implements specific container lifecycle hooks used to ensure that a container is set up correctly to allow GPU access after device nodes and driver files are injected using CDI. This CLI is aliased by the `nvidia-ctk hook` subcommand. * nvidia-container-runtime - As a wrapper for runc to add GPU support in environments where direct CDI device requests are not possible. This includes k3s, where the nvidia RuntimeClass is added automatically if thie nvidia-container-runtime is detected and used to ensure the injection of device nodes and libraries for the k8s-device-plugin containers. This change also renames the Docker build stage to nvidia-container-toolkit explicitly for clarity. Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar self-assigned this Mar 20, 2026

elezar force-pushed the feat/device-plugin-cdi-injection branch from 6840a21 to 6682d7d Compare March 20, 2026 17:00

elezar mentioned this pull request Mar 20, 2026

fix(gpu): add WSL2 GPU support via CDI mode #441

Closed

elezar force-pushed the feat/device-plugin-cdi-injection branch from 6682d7d to 9c39785 Compare March 20, 2026 20:47

elezar marked this pull request as ready for review March 20, 2026 20:47

elezar requested a review from a team as a code owner March 20, 2026 20:47

pimlock reviewed Mar 23, 2026

View reviewed changes

elezar force-pushed the feat/device-plugin-cdi-injection branch 2 times, most recently from da63c43 to 590e002 Compare March 25, 2026 11:09

elezar mentioned this pull request Mar 25, 2026

feat(gpu): add WSL2 CDI spec watcher for GPU passthrough #608

Draft

6 tasks

elezar added 4 commits March 25, 2026 15:41

docs(debug-skill): add CDI device plugin diagnostics for GPU gateways

dae37a2

Signed-off-by: Evan Lezar <elezar@nvidia.com>

feat(gpu): set deviceIDStrategy=index in device plugin Helm values

d5c1b9c

Using index-based device IDs improves compatibility across platforms including Jetson/Tegra-based and WSL2-based system. Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the feat/device-plugin-cdi-injection branch from a063051 to 9555515 Compare March 25, 2026 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sandbox): switch device plugin to CDI injection mode#503

feat(sandbox): switch device plugin to CDI injection mode#503
elezar wants to merge 4 commits intomainfrom
feat/device-plugin-cdi-injection

elezar commented Mar 20, 2026 •

edited

Loading

Uh oh!

pimlock Mar 23, 2026

Uh oh!

elezar Mar 25, 2026

Uh oh!

elezar Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	let runtime_classes: Api<DynamicObject> = Api::all_with(
	self.client.clone(),
	&ApiResource::from_gvk(&GroupVersionKind::gvk("node.k8s.io", "v1", "RuntimeClass")),
	);

	let runtime_class_exists = runtime_classes
	.get_opt(GPU_RUNTIME_CLASS_NAME)
	.await
	.map_err(\|err\| {
	tonic::Status::internal(format!("check GPU runtime class failed: {err}"))
	})?
	.is_some();

	if !runtime_class_exists {
	return Err(tonic::Status::failed_precondition(
	"GPU sandbox requested, but the active gateway is not GPU-enabled. To start a gateway with GPU support run: `openshell gateway start --gpu`",
	));
	}

Conversation

elezar commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

pimlock Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

elezar Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

elezar Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elezar commented Mar 20, 2026 •

edited

Loading