Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,28 @@ Writes `/etc/rancher/k3s/registries.yaml` from `REGISTRY_HOST`, `REGISTRY_ENDPOI

Copies bundled manifests from `/opt/openshell/manifests/` to `/var/lib/rancher/k3s/server/manifests/`. This is needed because the volume mount on `/var/lib/rancher/k3s` overwrites any files baked into that path at image build time.

### WSL2 CDI spec watcher

On WSL2 hosts with GPU support, the NVIDIA device plugin generates CDI (Container Device Interface) specs that k3s/containerd cannot consume directly. Two incompatibilities exist: the `cdiVersion` field uses a version that containerd rejects, and the device is named `"all"` instead of the numeric index `"0"` that containerd expects. The entrypoint solves this with a background watcher that transforms the spec in real time.

Three shell variables define the file paths:

| Variable | Value |
|---|---|
| `CDI_SPEC_DIR` | `/var/run/cdi` |
| `CDI_WSL_INPUT` | `/var/run/cdi/k8s.device-plugin.nvidia.com-gpu.json` (device plugin output) |
| `CDI_WSL_OUTPUT` | `/var/run/cdi/openshell-wsl.json` (transformed spec for containerd) |

`transform_wsl_cdi_spec()` uses `jq` to rewrite the input spec: it sets `cdiVersion` to `"0.5.0"` and renames `devices[0].name` from `"all"` to `"0"`. The write is atomic -- `jq` outputs to a PID-suffixed temp file, then `mv` replaces the output path.

`watch_cdi_specs()` runs as a background process:

1. Creates `CDI_SPEC_DIR` if missing.
2. Checks for a spec already present at startup (handles gateway container restarts). If found and it references `/dev/dxg`, transforms it immediately.
3. Enters a persistent `inotifywait` loop watching for `close_write` or `moved_to` events on the CDI spec directory. When the device plugin writes or moves a new spec matching the expected filename, and the spec references `/dev/dxg` (confirming WSL2 context), it triggers `transform_wsl_cdi_spec()`.

The watcher only starts when both `GPU_ENABLED=true` and `/dev/dxg` exists (a character device present only on WSL2 hosts). It runs in the background (`watch_cdi_specs &`) before `exec k3s`.

### Image configuration overrides

When environment variables are set, the entrypoint modifies the HelmChart manifest at `/var/lib/rancher/k3s/server/manifests/openshell-helmchart.yaml`:
Expand Down Expand Up @@ -299,7 +321,7 @@ GPU support is part of the single-node gateway bootstrap path rather than a sepa
- `openshell gateway start --gpu` threads a boolean deploy option through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
- When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`.
- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory. On WSL2 hosts (detected by `/dev/dxg`), the entrypoint also starts a background CDI spec watcher that transforms device plugin specs for k3s/containerd compatibility (see [WSL2 CDI spec watcher](#wsl2-cdi-spec-watcher) under Entrypoint Script).
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`. NFD and GFD are disabled; the device plugin's default `nodeAffinity` (which requires `feature.node.kubernetes.io/pci-10de.present=true` or `nvidia.com/gpu.present=true` from NFD/GFD) is overridden to empty so the DaemonSet schedules on the single-node cluster without requiring those labels.
- k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
- The OpenShell Helm chart grants the gateway service account cluster-scoped read access to `node.k8s.io/runtimeclasses` and core `nodes` so GPU sandbox admission can verify both the `nvidia` `RuntimeClass` and allocatable GPU capacity before creating a sandbox.
Expand All @@ -316,6 +338,11 @@ Host GPU drivers & NVIDIA Container Toolkit

The expected smoke test is a plain pod requesting `nvidia.com/gpu: 1` with `runtimeClassName: nvidia` and running `nvidia-smi`.

### WSL2 GPU specifics

On WSL2 hosts, the GPU is exposed through `/dev/dxg` rather than native NVIDIA device nodes. In the case where the NVIDIA device plugin is configured to use a CDI-base device list strategy, the generated CDI spec (`/var/run/cdi/k8s.device-plugin.nvidia.com-gpu.json`) needs to be transformed to list a device with name `"0"` instead of `"all"`. The cluster entrypoint runs a background `inotifywait`-based watcher that detects these specs and writes a corrected version to `/var/run/cdi/openshell-wsl.json`. See [WSL2 CDI spec watcher](#wsl2-cdi-spec-watcher) for implementation details.


## Remote Image Transfer

```mermaid
Expand Down
2 changes: 2 additions & 0 deletions deploy/docker/Dockerfile.images
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
iptables \
mount \
dnsutils \
inotify-tools \
jq \
&& rm -rf /var/lib/apt/lists/*

COPY --from=k3s /bin/ /bin/
Expand Down
44 changes: 44 additions & 0 deletions deploy/docker/cluster-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,45 @@ fi
# the k3s manifests directory so the Helm controller installs it automatically.
# The nvidia-container-runtime binary is already on PATH (baked into the image)
# so k3s registers the "nvidia" RuntimeClass at startup.
CDI_SPEC_DIR="/var/run/cdi"
CDI_WSL_INPUT="${CDI_SPEC_DIR}/k8s.device-plugin.nvidia.com-gpu.json"
CDI_WSL_OUTPUT="${CDI_SPEC_DIR}/openshell-wsl.json"

transform_wsl_cdi_spec() {
local tmp="${CDI_WSL_OUTPUT}.tmp.$$"
if jq '.cdiVersion = "0.5.0" | .devices[0].name = "0"' \
"$CDI_WSL_INPUT" > "$tmp" 2>/dev/null; then
mv "$tmp" "$CDI_WSL_OUTPUT"
echo "CDI: transformed WSL spec -> $CDI_WSL_OUTPUT"
else
rm -f "$tmp"
echo "CDI: failed to transform WSL spec (jq error)"
fi
}

watch_cdi_specs() {
if ! command -v inotifywait > /dev/null 2>&1; then
echo "CDI: inotifywait not found, skipping spec watcher"
return 1
fi

mkdir -p "$CDI_SPEC_DIR"

# Process spec already present at startup (e.g. gateway restart)
if [ -f "$CDI_WSL_INPUT" ] && grep -q '/dev/dxg' "$CDI_WSL_INPUT" 2>/dev/null; then
transform_wsl_cdi_spec
fi

# Watch for the spec to appear or be updated
inotifywait -m -e close_write,moved_to --format '%f' "$CDI_SPEC_DIR" 2>/dev/null \
| while IFS= read -r filename; do
if [ "$filename" = "k8s.device-plugin.nvidia.com-gpu.json" ] \
&& grep -q '/dev/dxg' "$CDI_WSL_INPUT" 2>/dev/null; then
transform_wsl_cdi_spec
fi
done
}

if [ "${GPU_ENABLED:-}" = "true" ]; then
echo "GPU support enabled — deploying NVIDIA device plugin"

Expand All @@ -327,6 +366,11 @@ if [ "${GPU_ENABLED:-}" = "true" ]; then
cp "$manifest" "$K3S_MANIFESTS/"
done
fi

if [ -c /dev/dxg ]; then
echo "WSL2 GPU detected (/dev/dxg present) — starting CDI spec watcher"
watch_cdi_specs &
fi
fi

# ---------------------------------------------------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@
# (which requires nvidia.com/gpu.present=true) is overridden to empty
# so it schedules on any node without requiring NFD/GFD labels.
#
# The device plugin is set to deviceIDStrategy=index so that device names are
# numeric indices (e.g. "0"). This simplifies the conversion of CDI specs on WSL
# systems, where we need to rename the *.nvidia.com/gpu=all device that is
# generated by the device plugin to *.nvidia.com/gpu=0.
#
# k3s auto-detects nvidia-container-runtime on PATH and registers the "nvidia"
# RuntimeClass automatically, so no manual RuntimeClass manifest is needed.

Expand All @@ -28,6 +33,7 @@ spec:
createNamespace: true
valuesContent: |-
runtimeClassName: nvidia
deviceIDStrategy: index
gfd:
enabled: false
nfd:
Expand Down
Loading