You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Topograph has Go unit tests and integration fixtures (tests/integration/payloads/*.json), but no end-to-end cluster-level suite that asserts "given a deployed chart, do the right labels and annotations actually land on nodes?" This gap is felt most acutely when changing the Helm chart, Node Observer, or engine output — there's no quick way to verify the full deployment-to-labels flow without manual cluster testing.
Proposed approach
Adopt Chainsaw (Kyverno team, Apache 2.0) — the declarative Kubernetes E2E test framework used widely across Kubernetes OSS. Tests are YAML that run apply → wait → assert → cleanup against a real cluster (kind in CI). Assertions use JMESPath-style expressions to verify cluster state, which fits Topograph's output model precisely: node labels, annotations, and ConfigMaps are exactly what Chainsaw is designed to assert.
Why Chainsaw specifically:
Topograph's Kubernetes engine writes cluster state — Chainsaw is purpose-built for asserting cluster state
The test provider + toposim models give deterministic inputs (no real hardware, no cloud credentials) — tests stay reproducible
Node Observer reactivity is testable — Chainsaw can add/remove nodes and assert the re-labeling occurs within the aggregation delay
Strong CNCF ecosystem alignment — Chainsaw is the common pattern for declarative Kubernetes conformance testing
Precedent in NVIDIA's own OSS — NVIDIA/aicr uses Chainsaw for AI Conformance testing (tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml etc.), which gives us a known-good structural template
Self-contained local development
Chainsaw + the test provider + toposim together enable a self-contained local development loop — contributors can clone the repo, spin up kind, install the chart, and run the full E2E suite on their own workstation without NVIDIA hardware, cloud provider credentials, or access to a shared cluster. This covers the majority of engine-facing code paths:
Code path
Self-contained with Chainsaw?
API server, aggregation, validation
✅
k8s engine output (labels)
✅
Slinky engine output (ConfigMap)
✅
Node Observer reactivity to node add/remove events
✅
Node Data Broker annotation lifecycle
✅
FNV-64a label truncation, canonical tree construction
✅
DRA provider (with nvidia.com/gpu.clique pre-populated on kind nodes)
Lowers the contribution bar. A new contributor can verify their change without privileged access to anything.
Strong OSS maturity signal. "Can a new contributor reproduce tests locally without privileged access?" is a well-established OSS health check; Chainsaw + kind + test provider is the canonical answer.
Pairs with the planned how-to tutorial. The tutorial scenario (kind + toposim + Helm install + demo workload) is the same setup; one investment yields tutorial, demo, and E2E test fixtures.
Proposed layout
Mirror AICR's tests/chainsaw/ structure:
tests/chainsaw/
chainsaw-config.yaml # Global timeouts, namespace cleanup strategy
README.md # How to run locally
fixtures/
toposim-2leaf-1spine.yaml # 4-node fabric: 2 leaves under 1 spine, 1 NVLink domain
helm-values-test.yaml # provider=test + engine=k8s + toposim model reference
suites/
label-application/ # Deploy → POST /v1/generate → assert labels on nodes
node-observer/ # Add node → assert regeneration + new node labeled
data-broker-annotations/ # DaemonSet runs → assert topograph.nvidia.com/* annotations
fnv-truncation/ # Long switch ID → assert x-prefixed hex value
slinky-configmap/ # engine=slinky → assert ConfigMap contents + annotations
topology-change/ # Swap toposim model → assert labels update, no stale leak
Deliverables
tests/chainsaw/ directory + suites above
Makefile: make e2e target that runs chainsaw test against an assumed-running kind cluster; optional make e2e-local that wraps kind create cluster + e2e + kind delete cluster
.github/workflows/e2e.yml — triggered on PRs touching charts/**, pkg/engines/k8s/**, pkg/node_observer/**, pkg/server/**, or tests/chainsaw/**. Uses helm/kind-action + kyverno/action-install-chainsaw
AGENTS.md + .claude/CLAUDE.md updates (same PR):
Repository map adds tests/chainsaw/
Commands section adds make e2e and make e2e-local
Testing and Deployment Workflows section describes the suite and self-contained local run instructions
Pre-push checklist mentions make e2e when a change touches the k8s engine, chart, server, or observer paths
PR guidelines note that chart/engine/observer changes should extend Chainsaw coverage
Update docs/engines/k8s.md — replace "Validation and Testing: TBD" with a pointer to the Chainsaw suite and the local-run instructions
tests/chainsaw/README.md — local run instructions with kind cluster bring-up, Chainsaw install, make e2e-local quickstart
Out of scope for v1
InfiniBand provider tests — require mocking ibnetdiscover output across exec calls; defer to a follow-up
NetQ provider tests — require a mock NetQ API server; defer
Multi-region tests — toposim supports it but the chart's DaemonSet+Observer wiring is single-region-assumed today
Performance / scale tests — separate concern; Chainsaw is not the right tool for benchmarking
Dependencies
Shares fixtures with the upcoming "how to" tutorial — a tutorial reader follows the same toposim model the Chainsaw test uses. "Tutorial works on your laptop" and "E2E test passes in CI" become the same claim.
Medium. Not blocking any currently open PR. Natural companion to the how-to-tutorial work and a strong OSS maturity signal. Best landed as a single PR covering the first two suites (label-application + node-observer) + the Makefile/CI workflow + AGENTS.md updates, with additional suites added incrementally.
Problem
docs/engines/k8s.mdcurrently ends with:Topograph has Go unit tests and integration fixtures (
tests/integration/payloads/*.json), but no end-to-end cluster-level suite that asserts "given a deployed chart, do the right labels and annotations actually land on nodes?" This gap is felt most acutely when changing the Helm chart, Node Observer, or engine output — there's no quick way to verify the full deployment-to-labels flow without manual cluster testing.Proposed approach
Adopt Chainsaw (Kyverno team, Apache 2.0) — the declarative Kubernetes E2E test framework used widely across Kubernetes OSS. Tests are YAML that run
apply → wait → assert → cleanupagainst a real cluster (kind in CI). Assertions use JMESPath-style expressions to verify cluster state, which fits Topograph's output model precisely: node labels, annotations, and ConfigMaps are exactly what Chainsaw is designed to assert.Why Chainsaw specifically:
testprovider + toposim models give deterministic inputs (no real hardware, no cloud credentials) — tests stay reproducibletests/chainsaw/ai-conformance/common/assert-kai-scheduler.yamletc.), which gives us a known-good structural templateSelf-contained local development
Chainsaw + the
testprovider + toposim together enable a self-contained local development loop — contributors can clone the repo, spin up kind, install the chart, and run the full E2E suite on their own workstation without NVIDIA hardware, cloud provider credentials, or access to a shared cluster. This covers the majority of engine-facing code paths:nvidia.com/gpu.cliquepre-populated on kind nodes)ibnetdiscoveroutput)*Simloaders elsewhereWhy this matters:
testprovider is the canonical answer.Proposed layout
Mirror AICR's
tests/chainsaw/structure:Deliverables
tests/chainsaw/directory + suites aboveMakefile:make e2etarget that runschainsaw testagainst an assumed-running kind cluster; optionalmake e2e-localthat wrapskind create cluster+e2e+kind delete cluster.github/workflows/e2e.yml— triggered on PRs touchingcharts/**,pkg/engines/k8s/**,pkg/node_observer/**,pkg/server/**, ortests/chainsaw/**. Useshelm/kind-action+kyverno/action-install-chainsawAGENTS.md+.claude/CLAUDE.mdupdates (same PR):tests/chainsaw/make e2eandmake e2e-localmake e2ewhen a change touches the k8s engine, chart, server, or observer pathsdocs/engines/k8s.md— replace "Validation and Testing: TBD" with a pointer to the Chainsaw suite and the local-run instructionstests/chainsaw/README.md— local run instructions with kind cluster bring-up, Chainsaw install,make e2e-localquickstartOut of scope for v1
ibnetdiscoveroutput across exec calls; defer to a follow-upDependencies
make qualifytarget) — future expansion to includee2efollows the AICR patternReference
~/dev/aicr/tests/chainsaw/ai-conformance/— cleanest JMESPath examples atcommon/assert-kai-scheduler.yamlPriority
Medium. Not blocking any currently open PR. Natural companion to the how-to-tutorial work and a strong OSS maturity signal. Best landed as a single PR covering the first two suites (
label-application+node-observer) + the Makefile/CI workflow + AGENTS.md updates, with additional suites added incrementally.