Add AKS Machine API CRUD benchmark pipeline#1200
Open
karenychen wants to merge 54 commits into
Open
Conversation
…machine_agentpool
Issue 1 (option 1): Delete the inner try/except in _create_batch_machines.
The outer run_chunk in _scale_machine_batch already catches with logger.exception
and excludes failed chunks from successful. Inner catch only logged the same
traceback twice and its docstring lied about contract ('never raises'). Updated
docstring to state truthfully that _make_batch_request may raise RuntimeError
on exhausted 429 retries or non-2xx, and the caller is expected to catch.
Issue 2: test_scale_machine_batch_partial_chunk_failure now pins the surviving
chunk's names via sorted(...)[:2], matching the pattern in
test_scale_machine_batch_worker_exception_isolated. Previous len()==2 assertion
would have passed if both chunks partially succeeded.
Minor 3: _scale_machine_batch return type changed from PEP 585 lowercase tuple
to typing.Tuple for consistency with rest of file; Tuple added to typing import.
Minor 5: _make_batch_request return annotation changed from requests.Response
to None — caller discards the value, matches actual usage.
The top-level /steps/{validate-resources,execute-tests,publish-results}.yml
dispatchers pass regions(object) + engine/engine_input — not a singular
region:string. Match the sibling k8s-crud-gpu convention by accepting
regions:object and forwarding regions[0] to the engine sub-templates.
ADO resolves $(VAR) macro substitutions at runtime, after template parsing. type:number params reject these at parse time. Switch to type:string to match the use_batch_api pattern; the Python CLI parses them via int() / str2bool downstream.
Subscription quota for AKS Standalone test sub does not allow Standard_D8ads_v5 (the AKS default when default_node_pool=null). Pin default pool to Standard_D2_v3 which is in the allowed SKU list and matches sibling gh-telescope scenarios pattern of explicit default_node_pool config.
…to 200
H1: Port ado-telescope's _get_machine_name_prefix(scale_machine_count) verbatim
so machine ARM resource names match across repos. Naming changes from
'tmach{i:04d}' to ado's '{prefix}-machine-{i+1}' (1-indexed) where prefix is
'scale{N}' or 'scale{N//1000}k' for multiples of 1000. Cross-repo Kusto
dashboards keyed on machine name will now correlate.
H2: Bump azure_eastus_large_machine_pool_staging scale_machine_count from 2
to 200 (the original intent — the previous value only exercised one batch
chunk per worker) and step_time_out from 900 to 1800 to give 200 VMs the
provisioning headroom they need. machine_workers=100 + use_batch_api=true
unchanged, so this produces 2 batch chunks of 100 machines each.
…meout (default 20min)
… baseline-aware readiness
… and node_readiness_time Mirrors the stdout logging that exists in ado-telescope so operators reading build logs see the same wall-clock numbers that already get serialized to the result blob. Two parity gaps: 1. node_readiness_time was never populated on the scale response (always 0 in uploaded JSON). Derive it from the max of the achieved percentile readiness times (closest analogue to ado's percentile_times[100]). 2. No stdout summary for percentile readiness or 'all nodes ready' wall-clock. Add three log lines matching ado azure.py:1497-1534 and a single 'Machine scaling completed in %.2f seconds' line matching ado machine_manager.py:198-201. Also add 'Agent pool creation completed' line for the create path matching ado machine_manager.py:158-161. No behavioral change to the API calls or readiness exit criteria.
…adiness envelope
Extend node readiness measurement from {P50, P90, P99} to
{P50, P70, P90, P99, P100} to match ado-telescope's percentile set and
the existing production Kusto schema for k8s_cluster_crud_machine_main.
Change percentile_node_readiness_times shape from flat float dict to
the ado-parity nested envelope:
{"P{p}": {
"target_nodes": int,
"elapsed_time_seconds": float | None,
"percentage": int,
"success": bool,
}}
This matches the shape ado-telescope writes to blob storage exactly
(verified against perf-eval/k8s-cluster-crud-machine/main/155688977-a37198d8.json).
node_readiness_time is now derived from P100.elapsed_time_seconds (the
true wall-clock until all target nodes are Ready), replacing the
max-of-achieved-percentiles approximation that was used when P100 was
not tracked.
User m0739: 'github telescope fundamentally has a different structure.
instead of trying to preserve the old code, think about how we can fix the
equivalent logic into the same structure to achieve alignment with the
github telescope code.'
Replaces the ado-telescope-shaped `machine/` package with the gh-telescope
CRUD pattern already used for node-pool operations:
1. clients/aks_machine_client.py
- Now `class AKSMachineClient(AKSClient)`: subclass instead of composition.
Inherits credential, ContainerServiceClient, KubernetesClient,
get_cluster_name, get_cluster_data, and the existing node-pool CRUD
methods (so callers can provision a baseline pool via inherited
create_node_pool without a second client).
- Public methods open an OperationContext exactly like
AKSClient.create_node_pool / scale_node_pool: metadata is enriched
with op.add_metadata along the way, success returns True, failures
are logged and re-raised so OperationContext records them.
- Module-level constants and the baseline-aware P50/P70/P90/P99/P100
readiness watcher kept intact.
2. crud/azure/machine_crud.py (new)
- Thin try/except mirror of crud/azure/node_pool_crud.py. All telemetry
happens in AKSMachineClient; this layer translates exceptions into
False so the CLI dispatcher can map to an exit code.
3. crud/main.py
- Adds create-machine / scale-machine subparsers and the matching
handle_machine_operation dispatcher.
- Adds _env_int_override / _env_bool_override helpers that ignore
unresolved ADO $(VAR) substitutions, mirroring how the machine CLI
used to layer env overrides on top of argparse defaults.
4. steps/engine/machine/k8s/{execute,collect}.yml
- Invoke `python -m crud.main create-machine | scale-machine | collect`
instead of the deleted machine.main entry point.
5. Deletes the entire modules/python/machine/ package, its tests, and the
machine_collect_golden fixtures. crud/main.py::collect_benchmark_results
already produces the gh-telescope row shape downstream Kusto ingestion
expects.
Verification
- pylint repo-root on all changed files: 10.00/10.
- pytest --cov=. --cov-fail-under=80 from modules/python:
713 passed, 1 skipped, 80.96% coverage.
- 33 new unit tests cover AKSMachineClient, MachineCRUD, and the new
crud/main.py helpers/dispatcher.
Repo .pylintrc already globally disables: too-many-locals, too-many-arguments, too-many-positional-arguments, too-many-instance-attributes, broad-exception-caught (and aliases like broad-except). The previous commit suppressed these inline in machine API code; remove those redundant annotations. Keep disables that are NOT in the global list: - unused-argument (on tags kwarg) - protected-access (in tests for _get_machine_name_prefix) - import-outside-toplevel (in tests for lazy patched imports) - invalid-name (pre-existing on NodePoolCRUD local alias) pylint 10.00/10. 33 machine-related tests pass.
…1185 PR Azure#1185 merged a per-worker batch contract that diverged from the local feat-branch copy. This commit takes origin/main's final versions of aks_machine_client.py and test_aks_machine_client.py verbatim so the feat branch stops carrying stale parallel copies of the same files.
Pipeline run 67972 logged 'urllib3.connectionpool - WARNING - Connection pool is full, discarding connection: management.azure.com' during the 50-worker individual scale path. urllib3's default HTTPAdapter caps pool_maxsize=10 per host; with machine_workers=50 the surplus connections are torn down and re-handshaked per request, defeating the Session reuse and adding TLS overhead under fan-out. Mount an HTTPAdapter with pool_connections=pool_maxsize=64 on self._session for both https:// and http:// so the pool holds one warm connection per worker thread (covers 50-worker individual + 4-worker batch paths with headroom; well below ARM per-tenant ceilings). Test: assert the mounted adapter advertises pool_connections == pool_maxsize == _HTTPS_POOL_SIZE >= 50. Verified: pytest test_aks_machine_client.py = 28 passed; pylint 10.00/10; full module suite --cov-fail-under=80 = 736 passed, 83.47% cov.
Run 67982 (200-machine batch test) reached P50=120.14s, P70=124.63s, P90=133.52s, P99=146.97s -- all green -- then timed out at P100=1200s because 2 of 200 nodes never reported Ready. All 200 PUTs succeeded (otherwise P99=198 would be unreachable); the tail was kubelet/ image-pull straggler noise, not a control-plane or API failure. Match ado-telescope behavior: hard-gate on P99 instead of P100. P100 is still computed, logged, and emitted to the 'percentile_node_readiness_times' metadata blob for Kusto trending -- it just no longer fails an otherwise-healthy run when a single node boots slowly on a large fan-out. Tests: - test_scale_machine_p99_met_p100_missed_does_not_raise: asserts that a straggler tail (P99 success, P100 miss) does not raise, and that the failed P100 entry is still present in metadata. - test_scale_machine_p99_missed_raises: asserts that a real readiness failure (P99 miss) still raises a RuntimeError naming P99.
…-machine # Conflicts: # modules/python/crud/azure/machine_crud.py # modules/python/crud/main.py
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new scheduled AKS Machine API CRUD benchmark pipeline and the topology/engine/scenario inputs needed to run it through the existing competitive-test flow.
Changes:
- Adds a daily Azure DevOps pipeline for
k8s-cluster-crud-machineincanadacentral. - Introduces topology dispatch templates and Machine API execute/collect engine templates.
- Adds Terraform scenario inputs and local test input JSON for the new benchmark.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
pipelines/perf-eval/Autoscale Benchmark/k8s-cluster-crud-machine.yml |
Defines the scheduled benchmark job and matrix values. |
steps/topology/k8s-cluster-crud-machine/validate-resources.yml |
Updates kubeconfig for the provisioned AKS cluster. |
steps/topology/k8s-cluster-crud-machine/execute-machine.yml |
Routes topology execution to the new Machine engine. |
steps/topology/k8s-cluster-crud-machine/collect-machine.yml |
Routes result collection to the new Machine engine. |
steps/engine/machine/k8s/execute.yml |
Runs create-machine and scale-machine benchmark commands. |
steps/engine/machine/k8s/collect.yml |
Collects Machine API benchmark result files. |
scenarios/perf-eval/k8s-cluster-crud-machine/terraform-inputs/azure.tfvars |
Defines production AKS scenario provisioning inputs. |
scenarios/perf-eval/k8s-cluster-crud-machine/terraform-test-inputs/azure.json |
Adds Terraform test input data for Azure. |
| - name: region | ||
| type: string | ||
|
|
||
| steps: |
| matrix: | ||
| small_machine_pool: | ||
| vm_size: Standard_D2_v3 | ||
| pool_name: smallpool1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation