Skip to content

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168

Draft
skosuri1 wants to merge 127 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2
Draft

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168
skosuri1 wants to merge 127 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2

Conversation

@skosuri1
Copy link
Copy Markdown

@skosuri1 skosuri1 commented May 6, 2026

Stacked on top of #1157 (skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.

This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.

Phase 3 Deliverables

  • 20-node baseline cluster size (spec line 24). Current clusters are 3 nodes (2 default + 1 prompool) — sized for harness validation, not real scale measurement.
  • Cluster-count tiers: add azure-5.tfvars, azure-10.tfvars, azure-20.tfvars and corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.
  • Parallel CL2 fan-out: replace sequential per-cluster CL2 with bounded concurrency (default 4). Requires async wrapping of utils.run_cl2_command (currently synchronous, modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.
  • etcd PodMonitor capacity check at 20 clusters: 28 watchers per cluster × 20 = 560 watchers; verify Prom scrape budget holds.
  • Scaling-curve dashboards from cluster-attributed results (Kusto).

Out of Scope (deferred to later phases / pre-merge of #1157)

skosuri and others added 30 commits May 6, 2026 13:59
…idn't fix root cause); fix n5 condition syntax
… referenced it but variables.tf didn't declare)
skosuri added 30 commits May 19, 2026 20:27
….0/8, 200 subnets, 100 AKS at 10xD4_v3) + condition:false dev-pipeline stage
…gation on all 100 pod subnets (forgot in initial gen; matches commit 0c0677e for peered tfvars)
…n_apply_failure (skip if profile already exists)
…_SECONDS) so a stuck CL2 doesn't block all 100 workers; +2 tests
…red-VNet 100 concurrent creates hit Azure per-VNet subnet PUT serialization; build 67774 evidence)
…eededState; aks_wait_succeeded fail-fast on terminal Failed (build 67775 evidence: 17% fail rate at parallelism=8)
…8 evidence: VirtualNetworkNotInSucceededState leaves cluster half-created; AlreadyExists on retry blocks recovery)
… 'already exists' match (build 67798: 99/100 clusters succeeded, only mesh-72 blocked by terraform-retry hitting AlreadyExists with CamelCase-only regex)
…L2 plumbing + n2 4-cell smoke stage (condition:false)
… stages + single-scenario soft-fail in execute.yml
…ing + test_type_suffix mechanism for Kusto cell discrimination in share-infra mode
…67959, Global Services scales 0/4/12/20 across g0/g20/g60/g100)
…_group_name + nic_public_ip_associations + nsr_rules) to azure-20-shared + azure-50-shared (build 67967 failed apply on var.network_config_list validation)
…ile list-members + AKS state dump during wait-for-apiserver)
…=10 default is more aggressive than what broke N=100 at p=8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant