Can a predictive controller scale Kubernetes earlier than a reactive HPA — and where does it fail? A reproducible lab to find out.
In one tracked 200 rps spike, the MPC controller cut burst p95 latency ~38% (85 → 52 ms) and p99 ~45%, both runs at 100% success.
Then I shortened the spike to 30 seconds — and MPC lost. New Pods became Ready about 40 seconds after the decision, after the spike was already over. The bottleneck wasn't the algorithm; it was the time Kubernetes needs to deliver ready capacity.
That gap — a correct decision arriving before usable capacity — is what this repo lets you measure, reproduce, and break. It is a lab, not a production autoscaler: a tunable Go workload, an HPA baseline, an MPC controller, an offline simulator, and evidence docs that link every published number to the run that produced it.
One representative tracked spike pair. See docs/RESULTS.md and docs/LIMITATIONS.md before generalising.
| Controller | Burst throughput | Burst p95 | Burst p99 | Max latency | Success | Max replicas |
|---|---|---|---|---|---|---|
| HPA60 baseline | 197.91 rps | 85.175 ms | 128.983 ms | 276.229 ms | 100.00% | 27 |
| Hybrid-SA MPC | 199.90 rps | 52.483 ms | 71.048 ms | 97.157 ms | 100.00% | 28 |
On a 30 s spike the same controller loses on tail latency, because readiness lag (~40 s) is longer than the spike — see docs/LIMITATIONS.md.
python3 -m venv .venv && source .venv/bin/activate
pip install -e analysis
mpc-validate-trace --trace-csv analysis/mpc_autoscaler_analysis/data/traces/baseline_spike_profile_dt15.csv
mpc-offline-sim \
--trace-csv analysis/mpc_autoscaler_analysis/data/traces/baseline_spike_profile_dt15.csv \
--out-dir analysis/out/offline/spikeThis validates the bundled spike trace and runs the offline simulator end to end (writes summary.json, trajectory.csv, and a plot). For staged paths — offline sim, saved evidence, live cluster runs — see docs/REPRODUCIBILITY.md.
Prefer not to install anything? The browser demo renders the same trajectory inline and links every number back to a reproducible source.
You do not need a Kubernetes cluster for many useful first contributions.
- Make a small verified PR — pick a
good first issue, then readCONTRIBUTING.md. - Challenge the methodology — read the 60-second walkthrough and tell me which assumption makes the comparison least convincing, via the Q&A thread or a reproduction report.
Useful areas: controller comparators, traffic traces, dashboard panels, artifact parsers, Kubernetes portability, and reproducibility docs. See ROADMAP.md for directions.
Recent external contributors: @dicnunz, @tatakaisun, @ayushkli86, @msaqibatifj, @mahek56, @kunal-9090. Merged PRs are credited in release notes.
flowchart LR
Load[Load profiles] --> Workload[toy-load /work]
Workload --> Metrics[Prometheus metrics]
Metrics --> HPA[HPA baseline]
Metrics --> MPC[MPC controller]
HPA --> Scale[Kubernetes scale target]
MPC --> Scale
Scale --> Workload
Runs[Run artifacts] --> Analysis[Analysis package]
Analysis --> Results[Summaries and figures]
toy-load/— a standalone Go module: controllable HTTP workload with Prometheus metrics, Helm chart, manifests, and a GHCR release image.analysis/— offline simulator, online MPC controller, grid-search tooling, and artifact summaries.deploy/,dashboards/,loadgen/— ArgoCD apps, monitoring manifests, Grafana dashboards, and repeatable load runners.
Supported scenarios: step (sustained increase), spike (short burst), seasonality (smooth sinusoidal).
| Document | What it covers |
|---|---|
docs/MPC_VS_HPA_60_SECONDS.md |
Fast technical walkthrough: problem, current spike result, trust boundary. |
docs/RESULTS.md |
Exact numbers, evidence paths, caveats, rebuild commands. |
docs/METHODOLOGY.md |
Experiment design and the MPC formulation. |
docs/LIMITATIONS.md |
What these numbers do not prove. |
docs/BENCHMARK_MATRIX.md |
Alias for RESULTS.md (kept for legacy links). |
docs/ARCHITECTURE.md |
Component boundaries, data flow, extension points. |
docs/API.md |
Public contracts and scripting surfaces. |
docs/REPRODUCIBILITY.md |
Staged reproduction paths. |
docs/DEMO.md |
Ten-second demo storyboard. |
| Docs site · Roadmap board | Hosted overview and tracked work. |
For offline work: Go 1.25, Python 3.11+. For live experiments also: Docker, kubectl, Helm, and a Kubernetes cluster. (vegeta optional for local load.)
Running experiments on a cluster
Deploy the workload:
helm upgrade --install toy-load toy-load/deploy/helm/toy-load \
--namespace default --create-namespace
# or: kubectl apply -f toy-load/deploy/manifestsMonitoring manifests require Prometheus/Grafana Operator CRDs: kubectl apply -k deploy/monitoring. ArgoCD apps live under deploy/argocd/. The chart defaults to ghcr.io/vshulcz/toy-load:main; pin with --set image.tag=<commit-or-release-tag>.
Run experiments:
bash loadgen/scripts/run_hpa_experiment_incluster.sh step # HPA baseline
bash loadgen/scripts/run_mpc_experiment_incluster.sh step # MPC controller
bash loadgen/scripts/run_hpa_mpc_batch.sh [N_MPC [N_HPA]] # matched batch
bash loadgen/scripts/run_mpc_v3_batch.sh all # calibrated MPC-only batchSummarize a run:
mpc-summarize-run --run-dir experiments/_runs/mpc-online/step/<run-id> \
--out-phase-csv /tmp/step_phases.csv --out-control-csv /tmp/step_control.csvArtifacts are written to ignored experiments/_runs/. Curated evidence stays ignored under experiments/; the repo commits only lightweight indices.
Local development
make toy-load-run
curl http://localhost:9090/healthz
curl "http://localhost:9090/work?cpu_ms=10&jitter_ms=5"
curl http://localhost:9090/metricsUseful targets: make help, make fmt, make check, make coverage. make check runs the toy-load checks used in CI (gofmt, go vet, tests, Helm lint, Helm template).
Observability, CI, releases, supply-chain
Metrics. toy-load exports toy_http_requests_total, toy_http_request_duration_seconds, toy_in_flight_requests, toy_work_cpu_seconds, toy_errors_total, and toy_panics_total. PromQL examples in docs/API.md.
CI. gofmt, go vet, go test (race), Go + Python coverage, packaged analysis install, shell + actionlint, dashboard/Helm schema validation, Helm lint + template, Kustomize render.
Security. CodeQL (Go + Python), govulncheck, Trivy fs + image scans, OpenSSF Scorecard, dependency review. All third-party Actions pinned by SHA.
Releases. Tag-driven via docs/RELEASE.md. Images publish to ghcr.io/vshulcz/toy-load with SBOM + provenance. Verify downloads against SHA256SUMS.
Status badges: Release · Pages · Security · CodeQL · Trivy · OpenSSF Scorecard
Apache License 2.0. See LICENSE. Security reporting: SECURITY.md. Support: SUPPORT.md.
