Skip to content

Add AKS Machine API CRUD benchmark pipeline#1200

Open
karenychen wants to merge 54 commits into
Azure:mainfrom
karenychen:feat/k8s-cluster-crud-machine
Open

Add AKS Machine API CRUD benchmark pipeline#1200
karenychen wants to merge 54 commits into
Azure:mainfrom
karenychen:feat/k8s-cluster-crud-machine

Conversation

@karenychen
Copy link
Copy Markdown
Contributor

Summary

  • add the production k8s-cluster-crud-machine Azure DevOps pipeline scheduled daily at 12:00 UTC in canadacentral
  • add the scenario terraform inputs and machine engine/topology templates needed by competitive-test
  • align the engine command with main's scale-machine CLI by omitting the stale --region argument

Validation

  • /tmp/telescope-yamllint-venv/bin/yamllint -c .yamllint . --no-warnings
  • terraform fmt -check -diff scenarios/perf-eval/k8s-cluster-crud-machine/terraform-inputs/azure.tfvars
  • jq . scenarios/perf-eval/k8s-cluster-crud-machine/terraform-test-inputs/azure.json
  • python YAML parse smoke for new pipeline/template files
  • PYTHONPATH=. python3 -m crud.main scale-machine --help
  • PYTHONPATH=. python3 -m pytest tests/test_aks_machine_client.py tests/test_crud_main.py tests/test_machine_crud.py

karenychen added 30 commits May 21, 2026 10:44
Issue 1 (option 1): Delete the inner try/except in _create_batch_machines.
The outer run_chunk in _scale_machine_batch already catches with logger.exception
and excludes failed chunks from successful. Inner catch only logged the same
traceback twice and its docstring lied about contract ('never raises'). Updated
docstring to state truthfully that _make_batch_request may raise RuntimeError
on exhausted 429 retries or non-2xx, and the caller is expected to catch.

Issue 2: test_scale_machine_batch_partial_chunk_failure now pins the surviving
chunk's names via sorted(...)[:2], matching the pattern in
test_scale_machine_batch_worker_exception_isolated. Previous len()==2 assertion
would have passed if both chunks partially succeeded.

Minor 3: _scale_machine_batch return type changed from PEP 585 lowercase tuple
to typing.Tuple for consistency with rest of file; Tuple added to typing import.

Minor 5: _make_batch_request return annotation changed from requests.Response
to None — caller discards the value, matches actual usage.
The top-level /steps/{validate-resources,execute-tests,publish-results}.yml
dispatchers pass regions(object) + engine/engine_input — not a singular
region:string. Match the sibling k8s-crud-gpu convention by accepting
regions:object and forwarding regions[0] to the engine sub-templates.
ADO resolves $(VAR) macro substitutions at runtime, after template
parsing. type:number params reject these at parse time. Switch to
type:string to match the use_batch_api pattern; the Python CLI parses
them via int() / str2bool downstream.
Subscription quota for AKS Standalone test sub does not allow Standard_D8ads_v5 (the AKS default when default_node_pool=null). Pin default pool to Standard_D2_v3 which is in the allowed SKU list and matches sibling gh-telescope scenarios pattern of explicit default_node_pool config.
karenychen added 16 commits May 21, 2026 10:44
…to 200

H1: Port ado-telescope's _get_machine_name_prefix(scale_machine_count) verbatim
so machine ARM resource names match across repos. Naming changes from
'tmach{i:04d}' to ado's '{prefix}-machine-{i+1}' (1-indexed) where prefix is
'scale{N}' or 'scale{N//1000}k' for multiples of 1000. Cross-repo Kusto
dashboards keyed on machine name will now correlate.

H2: Bump azure_eastus_large_machine_pool_staging scale_machine_count from 2
to 200 (the original intent — the previous value only exercised one batch
chunk per worker) and step_time_out from 900 to 1800 to give 200 VMs the
provisioning headroom they need. machine_workers=100 + use_batch_api=true
unchanged, so this produces 2 batch chunks of 100 machines each.
… and node_readiness_time

Mirrors the stdout logging that exists in ado-telescope so operators reading
build logs see the same wall-clock numbers that already get serialized to the
result blob. Two parity gaps:

1. node_readiness_time was never populated on the scale response (always 0 in
   uploaded JSON). Derive it from the max of the achieved percentile readiness
   times (closest analogue to ado's percentile_times[100]).

2. No stdout summary for percentile readiness or 'all nodes ready' wall-clock.
   Add three log lines matching ado azure.py:1497-1534 and a single
   'Machine scaling completed in %.2f seconds' line matching ado
   machine_manager.py:198-201. Also add 'Agent pool creation completed' line
   for the create path matching ado machine_manager.py:158-161.

No behavioral change to the API calls or readiness exit criteria.
…adiness envelope

Extend node readiness measurement from {P50, P90, P99} to
{P50, P70, P90, P99, P100} to match ado-telescope's percentile set and
the existing production Kusto schema for k8s_cluster_crud_machine_main.

Change percentile_node_readiness_times shape from flat float dict to
the ado-parity nested envelope:

  {"P{p}": {
      "target_nodes": int,
      "elapsed_time_seconds": float | None,
      "percentage": int,
      "success": bool,
  }}

This matches the shape ado-telescope writes to blob storage exactly
(verified against perf-eval/k8s-cluster-crud-machine/main/155688977-a37198d8.json).

node_readiness_time is now derived from P100.elapsed_time_seconds (the
true wall-clock until all target nodes are Ready), replacing the
max-of-achieved-percentiles approximation that was used when P100 was
not tracked.
User m0739: 'github telescope fundamentally has a different structure.
instead of trying to preserve the old code, think about how we can fix the
equivalent logic into the same structure to achieve alignment with the
github telescope code.'

Replaces the ado-telescope-shaped `machine/` package with the gh-telescope
CRUD pattern already used for node-pool operations:

1. clients/aks_machine_client.py
   - Now `class AKSMachineClient(AKSClient)`: subclass instead of composition.
     Inherits credential, ContainerServiceClient, KubernetesClient,
     get_cluster_name, get_cluster_data, and the existing node-pool CRUD
     methods (so callers can provision a baseline pool via inherited
     create_node_pool without a second client).
   - Public methods open an OperationContext exactly like
     AKSClient.create_node_pool / scale_node_pool: metadata is enriched
     with op.add_metadata along the way, success returns True, failures
     are logged and re-raised so OperationContext records them.
   - Module-level constants and the baseline-aware P50/P70/P90/P99/P100
     readiness watcher kept intact.

2. crud/azure/machine_crud.py (new)
   - Thin try/except mirror of crud/azure/node_pool_crud.py. All telemetry
     happens in AKSMachineClient; this layer translates exceptions into
     False so the CLI dispatcher can map to an exit code.

3. crud/main.py
   - Adds create-machine / scale-machine subparsers and the matching
     handle_machine_operation dispatcher.
   - Adds _env_int_override / _env_bool_override helpers that ignore
     unresolved ADO $(VAR) substitutions, mirroring how the machine CLI
     used to layer env overrides on top of argparse defaults.

4. steps/engine/machine/k8s/{execute,collect}.yml
   - Invoke `python -m crud.main create-machine | scale-machine | collect`
     instead of the deleted machine.main entry point.

5. Deletes the entire modules/python/machine/ package, its tests, and the
   machine_collect_golden fixtures. crud/main.py::collect_benchmark_results
   already produces the gh-telescope row shape downstream Kusto ingestion
   expects.

Verification
- pylint repo-root on all changed files: 10.00/10.
- pytest --cov=. --cov-fail-under=80 from modules/python:
  713 passed, 1 skipped, 80.96% coverage.
- 33 new unit tests cover AKSMachineClient, MachineCRUD, and the new
  crud/main.py helpers/dispatcher.
Repo .pylintrc already globally disables: too-many-locals,
too-many-arguments, too-many-positional-arguments, too-many-instance-attributes,
broad-exception-caught (and aliases like broad-except). The previous commit
suppressed these inline in machine API code; remove those redundant
annotations.

Keep disables that are NOT in the global list:
- unused-argument (on tags kwarg)
- protected-access (in tests for _get_machine_name_prefix)
- import-outside-toplevel (in tests for lazy patched imports)
- invalid-name (pre-existing on NodePoolCRUD local alias)

pylint 10.00/10. 33 machine-related tests pass.
…1185

PR Azure#1185 merged a per-worker batch contract that diverged from the local feat-branch copy. This commit takes origin/main's final versions of aks_machine_client.py and test_aks_machine_client.py verbatim so the feat branch stops carrying stale parallel copies of the same files.
Pipeline run 67972 logged 'urllib3.connectionpool - WARNING - Connection
pool is full, discarding connection: management.azure.com' during the
50-worker individual scale path. urllib3's default HTTPAdapter caps
pool_maxsize=10 per host; with machine_workers=50 the surplus
connections are torn down and re-handshaked per request, defeating the
Session reuse and adding TLS overhead under fan-out.

Mount an HTTPAdapter with pool_connections=pool_maxsize=64 on
self._session for both https:// and http:// so the pool holds one warm
connection per worker thread (covers 50-worker individual + 4-worker
batch paths with headroom; well below ARM per-tenant ceilings).

Test: assert the mounted adapter advertises pool_connections == pool_maxsize
== _HTTPS_POOL_SIZE >= 50.

Verified: pytest test_aks_machine_client.py = 28 passed; pylint 10.00/10;
full module suite --cov-fail-under=80 = 736 passed, 83.47% cov.
Run 67982 (200-machine batch test) reached P50=120.14s, P70=124.63s,
P90=133.52s, P99=146.97s -- all green -- then timed out at P100=1200s
because 2 of 200 nodes never reported Ready. All 200 PUTs succeeded
(otherwise P99=198 would be unreachable); the tail was kubelet/
image-pull straggler noise, not a control-plane or API failure.

Match ado-telescope behavior: hard-gate on P99 instead of P100. P100
is still computed, logged, and emitted to the
'percentile_node_readiness_times' metadata blob for Kusto trending --
it just no longer fails an otherwise-healthy run when a single node
boots slowly on a large fan-out.

Tests:
- test_scale_machine_p99_met_p100_missed_does_not_raise: asserts that
  a straggler tail (P99 success, P100 miss) does not raise, and that
  the failed P100 entry is still present in metadata.
- test_scale_machine_p99_missed_raises: asserts that a real readiness
  failure (P99 miss) still raises a RuntimeError naming P99.
…-machine

# Conflicts:
#	modules/python/crud/azure/machine_crud.py
#	modules/python/crud/main.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new scheduled AKS Machine API CRUD benchmark pipeline and the topology/engine/scenario inputs needed to run it through the existing competitive-test flow.

Changes:

  • Adds a daily Azure DevOps pipeline for k8s-cluster-crud-machine in canadacentral.
  • Introduces topology dispatch templates and Machine API execute/collect engine templates.
  • Adds Terraform scenario inputs and local test input JSON for the new benchmark.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pipelines/perf-eval/Autoscale Benchmark/k8s-cluster-crud-machine.yml Defines the scheduled benchmark job and matrix values.
steps/topology/k8s-cluster-crud-machine/validate-resources.yml Updates kubeconfig for the provisioned AKS cluster.
steps/topology/k8s-cluster-crud-machine/execute-machine.yml Routes topology execution to the new Machine engine.
steps/topology/k8s-cluster-crud-machine/collect-machine.yml Routes result collection to the new Machine engine.
steps/engine/machine/k8s/execute.yml Runs create-machine and scale-machine benchmark commands.
steps/engine/machine/k8s/collect.yml Collects Machine API benchmark result files.
scenarios/perf-eval/k8s-cluster-crud-machine/terraform-inputs/azure.tfvars Defines production AKS scenario provisioning inputs.
scenarios/perf-eval/k8s-cluster-crud-machine/terraform-test-inputs/azure.json Adds Terraform test input data for Azure.

- name: region
type: string

steps:
matrix:
small_machine_pool:
vm_size: Standard_D2_v3
pool_name: smallpool1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants