Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
40abdad
ci: harden H100 GPU qualification
yuanchen8911 Apr 25, 2026
ae0b717
ci: split aicr build artifacts
yuanchen8911 Apr 26, 2026
33417c3
ci: harden gpu cluster setup
yuanchen8911 Apr 26, 2026
b7ffe6b
ci: tune H100 workflow reliability
yuanchen8911 Apr 26, 2026
0834778
ci: address H100 review feedback
yuanchen8911 Apr 26, 2026
dfd2685
ci: clarify control-plane recovery handling
yuanchen8911 Apr 26, 2026
87d84a0
ci: address remaining review comments
yuanchen8911 Apr 26, 2026
d14f9c5
ci: bound H100 retry budgets
yuanchen8911 Apr 26, 2026
7a3505c
ci: install ko for Karpenter KWOK
yuanchen8911 Apr 26, 2026
c1ccd86
ci: retry KWOK Helm bootstrap
yuanchen8911 Apr 26, 2026
34f115e
ci: share control plane stability window
yuanchen8911 Apr 26, 2026
73dfd1e
ci: harden H100 runtime diagnostics
yuanchen8911 Apr 26, 2026
2887460
ci: harden H100 control plane and Dynamo retries
yuanchen8911 Apr 26, 2026
23b3d5f
ci: harden H100 control plane and Dynamo retries
yuanchen8911 Apr 26, 2026
0f308d6
Stabilize H100 GPU CI checks
yuanchen8911 Apr 26, 2026
22304c9
Address H100 CI review feedback
yuanchen8911 Apr 26, 2026
553051c
Avoid pull request events for GPU runners
yuanchen8911 Apr 26, 2026
ff99d6a
Address GPU CI review feedback
yuanchen8911 Apr 26, 2026
1d55329
ci: address GPU workflow review hardening
yuanchen8911 Apr 26, 2026
e1e931a
ci: harden H100 kind runtime workflows
yuanchen8911 Apr 27, 2026
727f1b0
ci: address follow-up GPU review feedback
yuanchen8911 Apr 27, 2026
4219435
Merge branch 'main' into ci-gpu-kind-fail-fast
mchmarny Apr 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .github/actions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ This directory contains a modular, reusable GitHub Actions architecture optimize

## Composite Actions

### Script Conventions

Composite action helper scripts in this directory are intentionally portable
across checkout modes: keep them mode `0644` and invoke them as
`bash path/to/script.sh` from workflows or `action.yml` files. Do not rely on
executable bits or `./script.sh` invocation.

### Core CI/CD Actions

#### `security-scan/`
Expand Down Expand Up @@ -50,7 +57,8 @@ This action runs `tools/setup-tools --skip-go --skip-docker` in auto mode, which
**When to use**: When you need version values in workflow steps
**Outputs**:
- `go`, `goreleaser`, `ko`, `crane`, `golangci_lint`, `yamllint`, `addlicense`
- `grype`, `kubectl`, `kind`, `ctlptl`, `tilt`, `helm`
- `grype`, `kubectl`, `kind`, `nvkind`, `ctlptl`, `tilt`, `helm`
- `kind_node_image`, `h100_kind_node_image`

**Example**:
```yaml
Expand Down
91 changes: 20 additions & 71 deletions .github/actions/aicr-build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,17 @@
# limitations under the License.

name: 'AICR Build'
description: 'Builds the aicr validator image (via Dockerfile) and CLI binary, and loads the image into kind.'
description: 'Builds the aicr CLI and optional snapshot/validator images, and loads requested images into kind.'

inputs:
build_cli:
description: 'Build and stage the aicr CLI binary at the repository root'
required: false
default: 'true'
build_snapshot_agent:
description: 'Build the CUDA-based snapshot agent image and load it into kind'
required: false
default: 'true'
build_validators:
description: 'Deprecated: use validator_phases instead. Ignored when validator_phases is set.'
required: false
Expand All @@ -28,86 +36,27 @@ inputs:
runs:
using: 'composite'
steps:

- name: Install ko
shell: bash
run: |
KO_VERSION=$(yq eval '.build_tools.ko' .settings.yaml)
GOFLAGS= go install "github.com/google/ko@${KO_VERSION}"

- name: Build snapshot agent image and load into kind
- name: Build aicr CLI binary
if: inputs.build_cli == 'true' || inputs.build_snapshot_agent == 'true'
shell: bash
env:
GOFLAGS: -mod=vendor
run: |
# Build snapshot agent image with CUDA base (provides nvidia-smi for GPU detection).
# Uses cuda:base (~250MB) instead of cuda:runtime (~1.8GB) — only nvidia-smi is needed.
# GPU test workflows use --image=ko.local:smoke-test for aicr snapshot.
CGO_ENABLED=0 go build -trimpath -o dist/aicr ./cmd/aicr
docker build -t ko.local:smoke-test -f - . <<'DOCKERFILE'
FROM nvcr.io/nvidia/cuda:13.1.0-base-ubuntu24.04
COPY dist/aicr /usr/local/bin/aicr
ENTRYPOINT ["/usr/local/bin/aicr"]
DOCKERFILE
run: bash "${{ github.action_path }}/build-cli.sh"

# Load onto all nodes. The snapshot agent requests nvidia.com/gpu but
# does not set a node selector, so it can land on any GPU-capable node
# including the control-plane (e.g., T4 smoke test).
#
# Timeout is intentionally generous (900s per attempt). H100 self-hosted
# runners transfer images over a shared Docker-in-Docker bridge; large
# CUDA base images (~250MB compressed) combined with I/O contention from
# parallel GPU operator pods regularly exceed the previous 600s limit.
timeout 900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}" || {
echo "::warning::kind load attempt 1 failed for ko.local:smoke-test, retrying..."
timeout 900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"
}
- name: Build snapshot agent image and load into kind
if: inputs.build_snapshot_agent == 'true'
shell: bash
run: bash "${{ github.action_path }}/build-snapshot-agent.sh"

- name: Build validator images and load into kind
if: "!(inputs.validator_phases == 'none' || (inputs.validator_phases == '' && inputs.build_validators == 'false'))"
shell: bash
env:
GOFLAGS: -mod=vendor
run: |
# Determine which validator phases to build.
# validator_phases takes precedence; build_validators is a deprecated fallback.
if [[ -n "${{ inputs.validator_phases }}" ]]; then
if [[ "${{ inputs.validator_phases }}" == "none" ]]; then
echo "Skipping validator builds (validator_phases=none)"
exit 0
fi
PHASES="${{ inputs.validator_phases }}"
else
# Default: build all phases (backwards compatible)
PHASES="deployment,performance,conformance"
fi

# Compile only the requested validator binaries.
mkdir -p dist/validator
for phase in ${PHASES//,/ }; do
echo "Building validator binary: ${phase}"
CGO_ENABLED=0 go build -trimpath -o "dist/validator/${phase}" "./validators/${phase}"
done

for phase in ${PHASES//,/ }; do
mkdir -p "validators/${phase}/testdata"
docker build -t "ko.local/aicr-validators/${phase}:latest" -f - . <<DOCKERFILE
FROM gcr.io/distroless/static-debian12:nonroot
COPY dist/validator/${phase} /${phase}
COPY validators/${phase}/testdata /app/testdata
WORKDIR /app
USER nonroot
ENTRYPOINT ["/${phase}"]
DOCKERFILE
# Validator images are small (~30MB distroless), but share the same
# Docker-in-Docker bridge as the smoke-test load above. 600s per
# attempt accommodates I/O queuing behind concurrent image pulls.
timeout 600 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}" || {
echo "::warning::kind load attempt 1 failed for ko.local/aicr-validators/${phase}:latest, retrying..."
timeout 600 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}"
}
done
VALIDATOR_PHASES: ${{ inputs.validator_phases }}
run: bash "${{ github.action_path }}/build-validator-images.sh"

- name: Stage aicr binary at repo root
if: inputs.build_cli == 'true'
shell: bash
run: cp dist/aicr ./aicr
run: bash "${{ github.action_path }}/stage-cli.sh"
19 changes: 19 additions & 0 deletions .github/actions/aicr-build/build-cli.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -euo pipefail

mkdir -p dist
CGO_ENABLED=0 go build -trimpath -o dist/aicr ./cmd/aicr
32 changes: 32 additions & 0 deletions .github/actions/aicr-build/build-snapshot-agent.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -euo pipefail

# Build snapshot agent image with CUDA base (provides nvidia-smi for GPU detection).
# Uses cuda:base (~250MB) instead of cuda:runtime (~1.8GB) because only nvidia-smi is needed.
timeout 900s docker build -t ko.local:smoke-test -f - . <<'DOCKERFILE'
FROM nvcr.io/nvidia/cuda:13.1.0-base-ubuntu24.04
COPY dist/aicr /usr/local/bin/aicr
ENTRYPOINT ["/usr/local/bin/aicr"]
DOCKERFILE

# Load onto all nodes. The snapshot agent requests nvidia.com/gpu but does not
# set a node selector, so it can land on any GPU-capable node including the
# control-plane in the L40G smoke test.
timeout 900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}" || {
echo "::warning::kind load attempt 1 failed for ko.local:smoke-test, retrying..."
timeout 900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"
}
59 changes: 59 additions & 0 deletions .github/actions/aicr-build/build-validator-images.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -euo pipefail

Comment thread
coderabbitai[bot] marked this conversation as resolved.
VALIDATOR_PHASES="${VALIDATOR_PHASES:-}"
if [[ -n "${VALIDATOR_PHASES}" ]]; then
if [[ "${VALIDATOR_PHASES}" == "none" ]]; then
echo "Skipping validator builds (validator_phases=none)"
exit 0
fi
PHASES="${VALIDATOR_PHASES}"
else
# Default: build all phases (backwards compatible).
PHASES="deployment,performance,conformance"
fi
Comment thread
coderabbitai[bot] marked this conversation as resolved.

: "${KIND_CLUSTER_NAME:?KIND_CLUSTER_NAME must be set}"

mkdir -p dist/validator
for phase in ${PHASES//,/ }; do
if ! [[ "${phase}" =~ ^[a-z][a-z0-9_-]*$ ]]; then
echo "::error::invalid validator phase '${phase}'; expected ^[a-z][a-z0-9_-]*$"
exit 1
fi
echo "Building validator binary: ${phase}"
CGO_ENABLED=0 go build -trimpath -o "dist/validator/${phase}" "./validators/${phase}"
done

for phase in ${PHASES//,/ }; do
if [[ ! -d "validators/${phase}/testdata" ]]; then
echo "::warning::validators/${phase}/testdata is missing; creating empty testdata directory"
mkdir -p "validators/${phase}/testdata"
fi
docker build -t "ko.local/aicr-validators/${phase}:latest" -f - . <<DOCKERFILE
FROM gcr.io/distroless/static-debian12:nonroot
COPY dist/validator/${phase} /${phase}
COPY validators/${phase}/testdata /app/testdata
WORKDIR /app
USER nonroot
ENTRYPOINT ["/${phase}"]
DOCKERFILE
timeout 600 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}" || {
echo "::warning::kind load attempt 1 failed for ko.local/aicr-validators/${phase}:latest, retrying..."
timeout 600 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}"
}
done
18 changes: 18 additions & 0 deletions .github/actions/aicr-build/stage-cli.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/usr/bin/env bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -euo pipefail

cp dist/aicr ./aicr
90 changes: 90 additions & 0 deletions .github/actions/check-control-plane-health/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: 'Check Control Plane Health'
description: 'Fails if Kind control-plane static pods are missing, unready, or unstable.'

inputs:
cluster_name:
description: 'Kind cluster name'
required: true
namespace:
description: 'Namespace that contains the control-plane pods'
required: false
default: kube-system
components:
description: 'Space-separated component label values to check'
required: false
default: kube-apiserver kube-controller-manager kube-scheduler etcd
wait_timeout:
description: 'Timeout for each component readiness wait'
required: false
default: 60s
max_restarts:
description: 'Deprecated compatibility input; historical restart counts are reported but not capped'
required: false
default: '1'
stability_window:
description: 'Optional duration to watch for new control-plane restarts after pods are Ready'
required: false
default: '0s'
stability_probe_interval:
description: 'Interval for active API server probes during the stability window'
required: false
default: '10s'
stability_probe_failure_threshold:
description: 'Consecutive active stability probe failures allowed before failing'
required: false
default: '2'
lease_components:
description: 'Space-separated leader election lease names to check for freshness'
required: false
default: kube-controller-manager kube-scheduler
lease_stale_timeout:
description: 'Maximum allowed leader election lease age at the end of a stability window'
required: false
default: '120s'
recover_unhealthy:
description: 'Restart eligible Kind control-plane static pod containers when they are currently unhealthy'
required: false
default: 'false'
recovery_components:
description: 'Space-separated component label values eligible for recovery'
required: false
default: kube-controller-manager kube-scheduler kube-apiserver
max_recovery_attempts:
description: 'Maximum recovery attempts for each eligible component'
required: false
default: '1'
Comment thread
coderabbitai[bot] marked this conversation as resolved.

runs:
using: 'composite'
steps:
- name: Check control-plane pods
shell: bash
env:
KIND_CLUSTER_NAME: ${{ inputs.cluster_name }}
NAMESPACE: ${{ inputs.namespace }}
COMPONENTS: ${{ inputs.components }}
WAIT_TIMEOUT: ${{ inputs.wait_timeout }}
MAX_RESTARTS: ${{ inputs.max_restarts }}
STABILITY_WINDOW: ${{ inputs.stability_window }}
STABILITY_PROBE_INTERVAL: ${{ inputs.stability_probe_interval }}
STABILITY_PROBE_FAILURE_THRESHOLD: ${{ inputs.stability_probe_failure_threshold }}
LEASE_COMPONENTS: ${{ inputs.lease_components }}
LEASE_STALE_TIMEOUT: ${{ inputs.lease_stale_timeout }}
RECOVER_UNHEALTHY: ${{ inputs.recover_unhealthy }}
RECOVERY_COMPONENTS: ${{ inputs.recovery_components }}
MAX_RECOVERY_ATTEMPTS: ${{ inputs.max_recovery_attempts }}
run: bash "${{ github.action_path }}/check-control-plane-health.sh"
Loading
Loading