Skip to content

feat(crud-machine-ci): wire create-machine/scale-machine into ADO pipeline#1194

Closed
karenychen wants to merge 3 commits into
mainfrom
pr/crud-machine-ci-wiring
Closed

feat(crud-machine-ci): wire create-machine/scale-machine into ADO pipeline#1194
karenychen wants to merge 3 commits into
mainfrom
pr/crud-machine-ci-wiring

Conversation

@karenychen
Copy link
Copy Markdown
Contributor

@karenychen karenychen commented May 26, 2026

Summary

Wires the AKS Machine API CLI (create-machine and scale-machine, shipped in PR #1187) into Telescope CI end-to-end: a new scenario, topology, engine step, and pipeline file.

Builds directly on current main — no Python or Terraform module changes.

Design (Option 2 hybrid)

Intentional divergence from the internal ado-telescope mirror to improve pipeline observability:

  • Reuse the existing crud engine instead of the raw engine. Mirrors PR feat(crud-machine): AKS Machine API client + CRUD layer + CLI #1187's separate create-machine / scale-machine subcommands so create and scale are two discrete observable pipeline steps rather than one Python process branching on --resource_type.
  • New scenario scenarios/perf-eval/k8s-cluster-crud-machine/ with default_node_pool = null. The Machines-mode pool is born via the ARM PUT performed inside create-machine; Terraform doesn't pretend to own it.
  • System pool injected at job time via the existing SYSTEM_NODE_POOL plumbing in steps/terraform/set-input-variables-azure.yml — no Terraform module changes.
  • New topology folder steps/topology/k8s-crud-machine/ mirroring the k8s-crud-gpu shape (validate-resources / execute-crud / collect-crud).
  • New sibling engine step steps/engine/crud/k8s/execute-machine.yml. Does not modify the existing execute.yml.
  • Teardown via terraform destroy only (no CLI delete subcommand exists today).
  • Region: canadacentral.
  • Schedule: daily at 12:00 UTC.
  • use_batch_api gated at bash level (matching the ado-telescope idiom) because the matrix value is a runtime variable and ADO template conditionals run at compile time.

Files

Added (7 files):

scenarios/perf-eval/k8s-cluster-crud-machine/terraform-inputs/azure.tfvars
scenarios/perf-eval/k8s-cluster-crud-machine/terraform-test-inputs/azure.json
steps/engine/crud/k8s/execute-machine.yml
steps/topology/k8s-crud-machine/validate-resources.yml
steps/topology/k8s-crud-machine/execute-crud.yml
steps/topology/k8s-crud-machine/collect-crud.yml
pipelines/perf-eval/AKS Machine API Benchmark/k8s-cluster-crud-machine.yml

Not modified: modules/python/crud/, modules/terraform/azure/aks-cli/, jobs/competitive-test.yml, steps/engine/crud/k8s/execute.yml, steps/engine/crud/k8s/collect.yml, steps/topology/k8s-crud-gpu/.

Pipeline matrix

Three entries (small/medium/large) varying pool size (1/50/100 machines), step timeout (600/600/900s), and use_batch_api (false/false/true). Daily cron at 12:00 UTC in canadacentral. max_parallel: 1.

Verification

  • yamllint -c .yamllint . --no-warnings
  • terraform fmt --check --diff scenarios/perf-eval/k8s-cluster-crud-machine/terraform-inputs/azure.tfvars
  • python3 YAML parse for all new YAML files and JSON parse for terraform-test-inputs/azure.json
  • git diff --check

CI on this PR will validate the same.

Follow-ups

  • 5k-variant pipeline (mirrors ado-telescope's k8s-cluster-crud-machine-5k.yml)
  • Per-machine delete-machine CLI subcommand if the platform team wants explicit teardown observability

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires the AKS Machine API CLI path (create-machine / scale-machine) into Telescope’s Azure DevOps perf-eval CI by adding a dedicated scenario, topology templates, a new CRUD engine step, and a scheduled pipeline definition.

Changes:

  • Adds a new perf-eval scenario (k8s-cluster-crud-machine) that provisions an AKS cluster shell via Terraform with default_node_pool = null, relying on runtime-injected system pool config.
  • Introduces a new topology (k8s-crud-machine) and a new CRUD engine step (execute-machine.yml) that runs create and scale as two discrete pipeline steps.
  • Adds a scheduled ADO pipeline with small/medium/large matrix entries and batch API gating.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
scenarios/perf-eval/k8s-cluster-crud-machine/terraform-inputs/azure.tfvars Defines the new scenario’s Terraform inputs, including default_node_pool = null rationale.
scenarios/perf-eval/k8s-cluster-crud-machine/terraform-test-inputs/azure.json Provides example JSON inputs for Terraform tests (system pool injection).
steps/engine/crud/k8s/execute-machine.yml New engine step to run create-machine and scale-machine as separate observable steps.
steps/topology/k8s-crud-machine/validate-resources.yml Topology validation step wiring kubeconfig for the scenario.
steps/topology/k8s-crud-machine/execute-crud.yml Topology execution step that calls the new execute-machine engine template.
steps/topology/k8s-crud-machine/collect-crud.yml Topology collection step wiring into the existing CRUD collect template.
pipelines/perf-eval/AKS Machine API Benchmark/k8s-cluster-crud-machine.yml New scheduled pipeline and matrix for the Machine API CRUD benchmark.

--machine-workers "$MACHINE_WORKERS" \
--readiness-wait-timeout "$READINESS_WAIT_TIMEOUT" \
--step-timeout "$STEP_TIME_OUT" \
$([ "${USE_BATCH_API,,}" = "true" ] && echo "--use-batch-api")
Comment thread scenarios/perf-eval/k8s-cluster-crud-machine/terraform-test-inputs/azure.json Outdated
Comment thread pipelines/perf-eval/AKS Machine API Benchmark/k8s-cluster-crud-machine.yml Outdated
Comment thread steps/engine/crud/k8s/execute-machine.yml
@karenychen karenychen force-pushed the pr/crud-machine-ci-wiring branch from b67130d to 8f5a5ae Compare May 27, 2026 18:42
…eline

Add scenario, topology, engine step, and pipeline file to exercise the
AKS Machine API CLI (`create-machine` and `scale-machine`, shipped
in PR #1187) end-to-end in the Telescope CI.

Design choices (Option 2 hybrid, intentional divergence from
ado-telescope):
  * Reuse the existing `crud` engine instead of introducing a new
    `raw` engine path; mirror PR #1187's separate subcommands so
    create and scale are two discrete observable pipeline steps.
  * New scenario `scenarios/perf-eval/k8s-cluster-crud-machine/`
    with `default_node_pool = null` (the Machines-mode pool is
    born via the ARM PUT performed by `create-machine`).
  * System pool is injected at job time via the existing
    `SYSTEM_NODE_POOL` plumbing in
    `steps/terraform/set-input-variables-azure.yml`.
  * New topology folder `steps/topology/k8s-crud-machine/` mirroring
    the `k8s-crud-gpu` shape (validate-resources / execute-crud /
    collect-crud).
  * New sibling engine step `steps/engine/crud/k8s/execute-machine.yml`
    that does NOT modify the existing `execute.yml`.
  * Teardown via terraform destroy only (no CLI delete step today).
  * Region: westus2.
  * `use_batch_api` is gated at the bash level (matching the
    ado-telescope idiom) because the matrix value is a runtime variable
    and ADO template conditionals run at compile time.
@karenychen karenychen force-pushed the pr/crud-machine-ci-wiring branch from 8f5a5ae to c148bbf Compare May 27, 2026 22:00
@karenychen
Copy link
Copy Markdown
Contributor Author

Closing this iteration so we can rework the pipeline approach and try again.

@karenychen karenychen closed this May 27, 2026
@karenychen karenychen deleted the pr/crud-machine-ci-wiring branch May 27, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants