Skip to content

Add Chainsaw conformance tests for the Kubernetes engine #263

@resker

Description

@resker

Problem

docs/engines/k8s.md currently ends with:

Validation and Testing

TBD

Topograph has Go unit tests and integration fixtures (tests/integration/payloads/*.json), but no end-to-end cluster-level suite that asserts "given a deployed chart, do the right labels and annotations actually land on nodes?" This gap is felt most acutely when changing the Helm chart, Node Observer, or engine output — there's no quick way to verify the full deployment-to-labels flow without manual cluster testing.

Proposed approach

Adopt Chainsaw (Kyverno team, Apache 2.0) — the declarative Kubernetes E2E test framework used widely across Kubernetes OSS. Tests are YAML that run apply → wait → assert → cleanup against a real cluster (kind in CI). Assertions use JMESPath-style expressions to verify cluster state, which fits Topograph's output model precisely: node labels, annotations, and ConfigMaps are exactly what Chainsaw is designed to assert.

Why Chainsaw specifically:

  1. Topograph's Kubernetes engine writes cluster state — Chainsaw is purpose-built for asserting cluster state
  2. The test provider + toposim models give deterministic inputs (no real hardware, no cloud credentials) — tests stay reproducible
  3. Node Observer reactivity is testable — Chainsaw can add/remove nodes and assert the re-labeling occurs within the aggregation delay
  4. Strong CNCF ecosystem alignment — Chainsaw is the common pattern for declarative Kubernetes conformance testing
  5. Precedent in NVIDIA's own OSS — NVIDIA/aicr uses Chainsaw for AI Conformance testing (tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml etc.), which gives us a known-good structural template

Self-contained local development

Chainsaw + the test provider + toposim together enable a self-contained local development loop — contributors can clone the repo, spin up kind, install the chart, and run the full E2E suite on their own workstation without NVIDIA hardware, cloud provider credentials, or access to a shared cluster. This covers the majority of engine-facing code paths:

Code path Self-contained with Chainsaw?
API server, aggregation, validation
k8s engine output (labels)
Slinky engine output (ConfigMap)
Node Observer reactivity to node add/remove events
Node Data Broker annotation lifecycle
FNV-64a label truncation, canonical tree construction
DRA provider (with nvidia.com/gpu.clique pre-populated on kind nodes)
IB provider (requires mocking ibnetdiscover output) ❌ — separate concern
NetQ provider (requires mock NetQ API server) ❌ — separate concern
Cloud CSP providers (AWS / GCP / OCI / Nebius / Lambda AI / CW) ❌ — covered by *Sim loaders elsewhere

Why this matters:

  • Lowers the contribution bar. A new contributor can verify their change without privileged access to anything.
  • Strong OSS maturity signal. "Can a new contributor reproduce tests locally without privileged access?" is a well-established OSS health check; Chainsaw + kind + test provider is the canonical answer.
  • Pairs with the planned how-to tutorial. The tutorial scenario (kind + toposim + Helm install + demo workload) is the same setup; one investment yields tutorial, demo, and E2E test fixtures.

Proposed layout

Mirror AICR's tests/chainsaw/ structure:

tests/chainsaw/
  chainsaw-config.yaml           # Global timeouts, namespace cleanup strategy
  README.md                      # How to run locally
  fixtures/
    toposim-2leaf-1spine.yaml    # 4-node fabric: 2 leaves under 1 spine, 1 NVLink domain
    helm-values-test.yaml        # provider=test + engine=k8s + toposim model reference
  suites/
    label-application/           # Deploy → POST /v1/generate → assert labels on nodes
    node-observer/               # Add node → assert regeneration + new node labeled
    data-broker-annotations/     # DaemonSet runs → assert topograph.nvidia.com/* annotations
    fnv-truncation/              # Long switch ID → assert x-prefixed hex value
    slinky-configmap/            # engine=slinky → assert ConfigMap contents + annotations
    topology-change/             # Swap toposim model → assert labels update, no stale leak

Deliverables

  • tests/chainsaw/ directory + suites above
  • Makefile: make e2e target that runs chainsaw test against an assumed-running kind cluster; optional make e2e-local that wraps kind create cluster + e2e + kind delete cluster
  • .github/workflows/e2e.yml — triggered on PRs touching charts/**, pkg/engines/k8s/**, pkg/node_observer/**, pkg/server/**, or tests/chainsaw/**. Uses helm/kind-action + kyverno/action-install-chainsaw
  • AGENTS.md + .claude/CLAUDE.md updates (same PR):
    • Repository map adds tests/chainsaw/
    • Commands section adds make e2e and make e2e-local
    • Testing and Deployment Workflows section describes the suite and self-contained local run instructions
    • Pre-push checklist mentions make e2e when a change touches the k8s engine, chart, server, or observer paths
    • PR guidelines note that chart/engine/observer changes should extend Chainsaw coverage
  • Update docs/engines/k8s.md — replace "Validation and Testing: TBD" with a pointer to the Chainsaw suite and the local-run instructions
  • tests/chainsaw/README.md — local run instructions with kind cluster bring-up, Chainsaw install, make e2e-local quickstart

Out of scope for v1

  • InfiniBand provider tests — require mocking ibnetdiscover output across exec calls; defer to a follow-up
  • NetQ provider tests — require a mock NetQ API server; defer
  • Multi-region tests — toposim supports it but the chart's DaemonSet+Observer wiring is single-region-assumed today
  • Performance / scale tests — separate concern; Chainsaw is not the right tool for benchmarking

Dependencies

  • Shares fixtures with the upcoming "how to" tutorial — a tutorial reader follows the same toposim model the Chainsaw test uses. "Tutorial works on your laptop" and "E2E test passes in CI" become the same claim.
  • Pairs with build: add make qualify pre-push aggregator #256 (make qualify target) — future expansion to include e2e follows the AICR pattern

Reference

  • AICR's Chainsaw setup: ~/dev/aicr/tests/chainsaw/ai-conformance/ — cleanest JMESPath examples at common/assert-kai-scheduler.yaml
  • Chainsaw docs: https://kyverno.github.io/chainsaw/

Priority

Medium. Not blocking any currently open PR. Natural companion to the how-to-tutorial work and a strong OSS maturity signal. Best landed as a single PR covering the first two suites (label-application + node-observer) + the Makefile/CI workflow + AGENTS.md updates, with additional suites added incrementally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions