Add AKS Machine API CRUD benchmark pipeline by karenychen · Pull Request #1200 · Azure/telescope

karenychen · 2026-05-27T22:57:25Z

Summary

add the production k8s-cluster-crud-machine Azure DevOps pipeline scheduled daily at 12:00 UTC in canadacentral
add the scenario terraform inputs and machine engine/topology templates needed by competitive-test
align the engine command with main's scale-machine CLI by omitting the stale --region argument

Validation

/tmp/telescope-yamllint-venv/bin/yamllint -c .yamllint . --no-warnings
terraform fmt -check -diff scenarios/perf-eval/k8s-cluster-crud-machine/terraform-inputs/azure.tfvars
jq . scenarios/perf-eval/k8s-cluster-crud-machine/terraform-test-inputs/azure.json
python YAML parse smoke for new pipeline/template files
PYTHONPATH=. python3 -m crud.main scale-machine --help
PYTHONPATH=. python3 -m pytest tests/test_aks_machine_client.py tests/test_crud_main.py tests/test_machine_crud.py

…rors

…nfig

…machine_agentpool

…ilures

…ch guard

Issue 1 (option 1): Delete the inner try/except in _create_batch_machines. The outer run_chunk in _scale_machine_batch already catches with logger.exception and excludes failed chunks from successful. Inner catch only logged the same traceback twice and its docstring lied about contract ('never raises'). Updated docstring to state truthfully that _make_batch_request may raise RuntimeError on exhausted 429 retries or non-2xx, and the caller is expected to catch. Issue 2: test_scale_machine_batch_partial_chunk_failure now pins the surviving chunk's names via sorted(...)[:2], matching the pattern in test_scale_machine_batch_worker_exception_isolated. Previous len()==2 assertion would have passed if both chunks partially succeeded. Minor 3: _scale_machine_batch return type changed from PEP 585 lowercase tuple to typing.Tuple for consistency with rest of file; Tuple added to typing import. Minor 5: _make_batch_request return annotation changed from requests.Response to None — caller discards the value, matches actual usage.

…r_name

…then soft tests

…ase-a]

The top-level /steps/{validate-resources,execute-tests,publish-results}.yml dispatchers pass regions(object) + engine/engine_input — not a singular region:string. Match the sibling k8s-crud-gpu convention by accepting regions:object and forwarding regions[0] to the engine sub-templates.

ADO resolves $(VAR) macro substitutions at runtime, after template parsing. type:number params reject these at parse time. Switch to type:string to match the use_batch_api pattern; the Python CLI parses them via int() / str2bool downstream.

Subscription quota for AKS Standalone test sub does not allow Standard_D8ads_v5 (the AKS default when default_node_pool=null). Pin default pool to Standard_D2_v3 which is in the allowed SKU list and matches sibling gh-telescope scenarios pattern of explicit default_node_pool config.

…to 200 H1: Port ado-telescope's _get_machine_name_prefix(scale_machine_count) verbatim so machine ARM resource names match across repos. Naming changes from 'tmach{i:04d}' to ado's '{prefix}-machine-{i+1}' (1-indexed) where prefix is 'scale{N}' or 'scale{N//1000}k' for multiples of 1000. Cross-repo Kusto dashboards keyed on machine name will now correlate. H2: Bump azure_eastus_large_machine_pool_staging scale_machine_count from 2 to 200 (the original intent — the previous value only exercised one batch chunk per worker) and step_time_out from 900 to 1800 to give 200 VMs the provisioning headroom they need. machine_workers=100 + use_batch_api=true unchanged, so this produces 2 batch chunks of 100 machines each.

…meout (default 20min)

…ase-a]

… baseline-aware readiness

… and node_readiness_time Mirrors the stdout logging that exists in ado-telescope so operators reading build logs see the same wall-clock numbers that already get serialized to the result blob. Two parity gaps: 1. node_readiness_time was never populated on the scale response (always 0 in uploaded JSON). Derive it from the max of the achieved percentile readiness times (closest analogue to ado's percentile_times[100]). 2. No stdout summary for percentile readiness or 'all nodes ready' wall-clock. Add three log lines matching ado azure.py:1497-1534 and a single 'Machine scaling completed in %.2f seconds' line matching ado machine_manager.py:198-201. Also add 'Agent pool creation completed' line for the create path matching ado machine_manager.py:158-161. No behavioral change to the API calls or readiness exit criteria.

…adiness envelope Extend node readiness measurement from {P50, P90, P99} to {P50, P70, P90, P99, P100} to match ado-telescope's percentile set and the existing production Kusto schema for k8s_cluster_crud_machine_main. Change percentile_node_readiness_times shape from flat float dict to the ado-parity nested envelope: {"P{p}": { "target_nodes": int, "elapsed_time_seconds": float | None, "percentage": int, "success": bool, }} This matches the shape ado-telescope writes to blob storage exactly (verified against perf-eval/k8s-cluster-crud-machine/main/155688977-a37198d8.json). node_readiness_time is now derived from P100.elapsed_time_seconds (the true wall-clock until all target nodes are Ready), replacing the max-of-achieved-percentiles approximation that was used when P100 was not tracked.

User m0739: 'github telescope fundamentally has a different structure. instead of trying to preserve the old code, think about how we can fix the equivalent logic into the same structure to achieve alignment with the github telescope code.' Replaces the ado-telescope-shaped `machine/` package with the gh-telescope CRUD pattern already used for node-pool operations: 1. clients/aks_machine_client.py - Now `class AKSMachineClient(AKSClient)`: subclass instead of composition. Inherits credential, ContainerServiceClient, KubernetesClient, get_cluster_name, get_cluster_data, and the existing node-pool CRUD methods (so callers can provision a baseline pool via inherited create_node_pool without a second client). - Public methods open an OperationContext exactly like AKSClient.create_node_pool / scale_node_pool: metadata is enriched with op.add_metadata along the way, success returns True, failures are logged and re-raised so OperationContext records them. - Module-level constants and the baseline-aware P50/P70/P90/P99/P100 readiness watcher kept intact. 2. crud/azure/machine_crud.py (new) - Thin try/except mirror of crud/azure/node_pool_crud.py. All telemetry happens in AKSMachineClient; this layer translates exceptions into False so the CLI dispatcher can map to an exit code. 3. crud/main.py - Adds create-machine / scale-machine subparsers and the matching handle_machine_operation dispatcher. - Adds _env_int_override / _env_bool_override helpers that ignore unresolved ADO $(VAR) substitutions, mirroring how the machine CLI used to layer env overrides on top of argparse defaults. 4. steps/engine/machine/k8s/{execute,collect}.yml - Invoke `python -m crud.main create-machine | scale-machine | collect` instead of the deleted machine.main entry point. 5. Deletes the entire modules/python/machine/ package, its tests, and the machine_collect_golden fixtures. crud/main.py::collect_benchmark_results already produces the gh-telescope row shape downstream Kusto ingestion expects. Verification - pylint repo-root on all changed files: 10.00/10. - pytest --cov=. --cov-fail-under=80 from modules/python: 713 passed, 1 skipped, 80.96% coverage. - 33 new unit tests cover AKSMachineClient, MachineCRUD, and the new crud/main.py helpers/dispatcher.

Repo .pylintrc already globally disables: too-many-locals, too-many-arguments, too-many-positional-arguments, too-many-instance-attributes, broad-exception-caught (and aliases like broad-except). The previous commit suppressed these inline in machine API code; remove those redundant annotations. Keep disables that are NOT in the global list: - unused-argument (on tags kwarg) - protected-access (in tests for _get_machine_name_prefix) - import-outside-toplevel (in tests for lazy patched imports) - invalid-name (pre-existing on NodePoolCRUD local alias) pylint 10.00/10. 33 machine-related tests pass.

…ger calls

…1185 PR Azure#1185 merged a per-worker batch contract that diverged from the local feat-branch copy. This commit takes origin/main's final versions of aks_machine_client.py and test_aks_machine_client.py verbatim so the feat branch stops carrying stale parallel copies of the same files.

Pipeline run 67972 logged 'urllib3.connectionpool - WARNING - Connection pool is full, discarding connection: management.azure.com' during the 50-worker individual scale path. urllib3's default HTTPAdapter caps pool_maxsize=10 per host; with machine_workers=50 the surplus connections are torn down and re-handshaked per request, defeating the Session reuse and adding TLS overhead under fan-out. Mount an HTTPAdapter with pool_connections=pool_maxsize=64 on self._session for both https:// and http:// so the pool holds one warm connection per worker thread (covers 50-worker individual + 4-worker batch paths with headroom; well below ARM per-tenant ceilings). Test: assert the mounted adapter advertises pool_connections == pool_maxsize == _HTTPS_POOL_SIZE >= 50. Verified: pytest test_aks_machine_client.py = 28 passed; pylint 10.00/10; full module suite --cov-fail-under=80 = 736 passed, 83.47% cov.

Run 67982 (200-machine batch test) reached P50=120.14s, P70=124.63s, P90=133.52s, P99=146.97s -- all green -- then timed out at P100=1200s because 2 of 200 nodes never reported Ready. All 200 PUTs succeeded (otherwise P99=198 would be unreachable); the tail was kubelet/ image-pull straggler noise, not a control-plane or API failure. Match ado-telescope behavior: hard-gate on P99 instead of P100. P100 is still computed, logged, and emitted to the 'percentile_node_readiness_times' metadata blob for Kusto trending -- it just no longer fails an otherwise-healthy run when a single node boots slowly on a large fan-out. Tests: - test_scale_machine_p99_met_p100_missed_does_not_raise: asserts that a straggler tail (P99 success, P100 miss) does not raise, and that the failed P100 entry is still present in metadata. - test_scale_machine_p99_missed_raises: asserts that a real readiness failure (P99 miss) still raises a RuntimeError naming P99.

…-machine # Conflicts: # modules/python/crud/azure/machine_crud.py # modules/python/crud/main.py

Copilot

Pull request overview

Adds a new scheduled AKS Machine API CRUD benchmark pipeline and the topology/engine/scenario inputs needed to run it through the existing competitive-test flow.

Changes:

Adds a daily Azure DevOps pipeline for k8s-cluster-crud-machine in canadacentral.
Introduces topology dispatch templates and Machine API execute/collect engine templates.
Adds Terraform scenario inputs and local test input JSON for the new benchmark.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`pipelines/perf-eval/Autoscale Benchmark/k8s-cluster-crud-machine.yml`	Defines the scheduled benchmark job and matrix values.
`steps/topology/k8s-cluster-crud-machine/validate-resources.yml`	Updates kubeconfig for the provisioned AKS cluster.
`steps/topology/k8s-cluster-crud-machine/execute-machine.yml`	Routes topology execution to the new Machine engine.
`steps/topology/k8s-cluster-crud-machine/collect-machine.yml`	Routes result collection to the new Machine engine.
`steps/engine/machine/k8s/execute.yml`	Runs `create-machine` and `scale-machine` benchmark commands.
`steps/engine/machine/k8s/collect.yml`	Collects Machine API benchmark result files.
`scenarios/perf-eval/k8s-cluster-crud-machine/terraform-inputs/azure.tfvars`	Defines production AKS scenario provisioning inputs.
`scenarios/perf-eval/k8s-cluster-crud-machine/terraform-test-inputs/azure.json`	Adds Terraform test input data for Azure.

+  - name: region
+    type: string
+
+steps:


+          matrix:
+            small_machine_pool:
+              vm_size: Standard_D2_v3
+              pool_name: smallpool1


karenychen added 30 commits May 21, 2026 10:44

feat(crud-machine): add scenario terraform inputs

ac4f829

feat(crud-machine): add machine data classes

e74cb69

feat(crud-machine): add result handler with nested payload

33226c8

fix(crud-machine): delegate to save_info_to_file and isolate write er…

5ebed58

…rors

feat(crud-machine): add AKSMachineClient composing AKSClient

2fc85c0

fix(crud-machine): drop AKSMachineClient.cluster_name and load_kubeco…

f5c83a5

…nfig

feat(crud-machine): port create_machine_agentpool

5cd6956

fix(crud-machine): add docstrings and failure-branch tests to create_…

bac2e3b

…machine_agentpool

feat(crud-machine): port machine readiness percentile wait

5f0689c

fix(crud-machine): harden machine readiness wait against transient fa…

4fb3bfb

…ilures

feat(crud-machine): port scale_machine non-batch path

99df955

fix(crud-machine): bounded retry on missing provisioningState and bat…

da9ff1f

…ch guard

feat(crud-machine): port scale_machine batch path

350f74d

feat(crud-machine): add MachineManager orchestrator

db7a461

fix(crud-machine): early-return when create fails and validate cluste…

0e18207

…r_name

feat(crud-machine): add CLI entrypoint

0af4304

fix(crud-machine): harden CLI input validation and exception handling

0743f19

feat(crud-machine): aggregator with golden-file parity gate

5e51d53

fix(crud-machine): handle malformed payloads and drop unused cloud arg

4d15268

chore(crud-machine): satisfy pylint and coverage gates

c116367

fix(crud-machine): use k8s_client attribute name to match AKSClient

914370e

fix(crud-machine): drop dead SystemExit branch and unused arg, streng…

436f752

…then soft tests

feat(crud-machine): add machine engine step templates

69acbe4

feat(crud-machine): add topology step templates

d4dce4a

feat(crud-machine): add production pipeline (Autoscale Benchmark)

cb38e4d

test(crud-machine): stage in new-pipeline-test for E2E validation [ph…

b39bd35

…ase-a]

karenychen added 16 commits May 21, 2026 10:44

fix(crud-machine): G1 token cache + G2 BatchPutMachine header pattern

896a8c1

fix(crud-machine): I1 separate readiness-wait-timeout from request.ti…

95b4756

…meout (default 20min)

fix(crud-machine): J1 use ado verbatim BatchPutMachine envelope shape

742c234

test(crud-machine): large stage 4 workers x 50 machines per batch [ph…

8443a1f

…ase-a]

test(crud-machine): parameterize region in staging pipeline [phase-a]

5c3cc79

fix(crud-machine): drop per-machine ARM poll, single agentpool check,…

0091857

… baseline-aware readiness

chore(crud-machine): drop ENV_* override defense + adopt f-string log…

ab00a66

…ger calls

Merge remote-tracking branch 'origin/main' into feat/k8s-cluster-crud…

ebc745e

…-machine # Conflicts: # modules/python/crud/azure/machine_crud.py # modules/python/crud/main.py

Copilot AI review requested due to automatic review settings May 27, 2026 22:57

karenychen requested review from alyssa1303, anson627, liyu-ma, sumanthreddy29, vittoriasalim, wonderyl and xinWeiWei24 as code owners May 27, 2026 22:57

Copilot started reviewing on behalf of karenychen May 27, 2026 22:57 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread steps/engine/machine/k8s/collect.yml

- name: region

type: string

steps:

Comment thread pipelines/perf-eval/Autoscale Benchmark/k8s-cluster-crud-machine.yml

matrix:

small_machine_pool:

vm_size: Standard_D2_v3

pool_name: smallpool1

karenychen added 4 commits May 27, 2026 16:19

feat(crud-machine): use az cli for non-batch machine creates

4ba684e

ci: wire new pipeline test to machine crud scenario

4affdd3

ci: add larger machine crud test cases

d1af072

ci: align machine crud stage names

99523b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AKS Machine API CRUD benchmark pipeline#1200

Add AKS Machine API CRUD benchmark pipeline#1200
karenychen wants to merge 54 commits into
Azure:mainfrom
karenychen:feat/k8s-cluster-crud-machine

karenychen commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

karenychen commented May 27, 2026

Summary

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants