Skip to content

vshulcz/mpc-autoscaler

mpc-autoscaler

Can a predictive controller scale Kubernetes earlier than a reactive HPA — and where does it fail? A reproducible lab to find out.

CI Codecov License Go

Ten-second autoscaling loop

In one tracked 200 rps spike, the MPC controller cut burst p95 latency ~38% (85 → 52 ms) and p99 ~45%, both runs at 100% success.

Then I shortened the spike to 30 seconds — and MPC lost. New Pods became Ready about 40 seconds after the decision, after the spike was already over. The bottleneck wasn't the algorithm; it was the time Kubernetes needs to deliver ready capacity.

That gap — a correct decision arriving before usable capacity — is what this repo lets you measure, reproduce, and break. It is a lab, not a production autoscaler: a tunable Go workload, an HPA baseline, an MPC controller, an offline simulator, and evidence docs that link every published number to the run that produced it.

Results snapshot

One representative tracked spike pair. See docs/RESULTS.md and docs/LIMITATIONS.md before generalising.

Controller Burst throughput Burst p95 Burst p99 Max latency Success Max replicas
HPA60 baseline 197.91 rps 85.175 ms 128.983 ms 276.229 ms 100.00% 27
Hybrid-SA MPC 199.90 rps 52.483 ms 71.048 ms 97.157 ms 100.00% 28

On a 30 s spike the same controller loses on tail latency, because readiness lag (~40 s) is longer than the spike — see docs/LIMITATIONS.md.

Try it in 60 seconds (no cluster needed)

python3 -m venv .venv && source .venv/bin/activate
pip install -e analysis
mpc-validate-trace --trace-csv analysis/mpc_autoscaler_analysis/data/traces/baseline_spike_profile_dt15.csv
mpc-offline-sim \
  --trace-csv analysis/mpc_autoscaler_analysis/data/traces/baseline_spike_profile_dt15.csv \
  --out-dir analysis/out/offline/spike

This validates the bundled spike trace and runs the offline simulator end to end (writes summary.json, trajectory.csv, and a plot). For staged paths — offline sim, saved evidence, live cluster runs — see docs/REPRODUCIBILITY.md.

Prefer not to install anything? The browser demo renders the same trajectory inline and links every number back to a reproducible source.

Contribute

You do not need a Kubernetes cluster for many useful first contributions.

Useful areas: controller comparators, traffic traces, dashboard panels, artifact parsers, Kubernetes portability, and reproducibility docs. See ROADMAP.md for directions.

Recent external contributors: @dicnunz, @tatakaisun, @ayushkli86, @msaqibatifj, @mahek56, @kunal-9090. Merged PRs are credited in release notes.

What's inside

flowchart LR
  Load[Load profiles] --> Workload[toy-load /work]
  Workload --> Metrics[Prometheus metrics]
  Metrics --> HPA[HPA baseline]
  Metrics --> MPC[MPC controller]
  HPA --> Scale[Kubernetes scale target]
  MPC --> Scale
  Scale --> Workload
  Runs[Run artifacts] --> Analysis[Analysis package]
  Analysis --> Results[Summaries and figures]
Loading
  • toy-load/ — a standalone Go module: controllable HTTP workload with Prometheus metrics, Helm chart, manifests, and a GHCR release image.
  • analysis/ — offline simulator, online MPC controller, grid-search tooling, and artifact summaries.
  • deploy/, dashboards/, loadgen/ — ArgoCD apps, monitoring manifests, Grafana dashboards, and repeatable load runners.

Supported scenarios: step (sustained increase), spike (short burst), seasonality (smooth sinusoidal).

Where to go next

Document What it covers
docs/MPC_VS_HPA_60_SECONDS.md Fast technical walkthrough: problem, current spike result, trust boundary.
docs/RESULTS.md Exact numbers, evidence paths, caveats, rebuild commands.
docs/METHODOLOGY.md Experiment design and the MPC formulation.
docs/LIMITATIONS.md What these numbers do not prove.
docs/BENCHMARK_MATRIX.md Alias for RESULTS.md (kept for legacy links).
docs/ARCHITECTURE.md Component boundaries, data flow, extension points.
docs/API.md Public contracts and scripting surfaces.
docs/REPRODUCIBILITY.md Staged reproduction paths.
docs/DEMO.md Ten-second demo storyboard.
Docs site · Roadmap board Hosted overview and tracked work.

Prerequisites

For offline work: Go 1.25, Python 3.11+. For live experiments also: Docker, kubectl, Helm, and a Kubernetes cluster. (vegeta optional for local load.)

Running experiments on a cluster

Deploy the workload:

helm upgrade --install toy-load toy-load/deploy/helm/toy-load \
  --namespace default --create-namespace
# or: kubectl apply -f toy-load/deploy/manifests

Monitoring manifests require Prometheus/Grafana Operator CRDs: kubectl apply -k deploy/monitoring. ArgoCD apps live under deploy/argocd/. The chart defaults to ghcr.io/vshulcz/toy-load:main; pin with --set image.tag=<commit-or-release-tag>.

Run experiments:

bash loadgen/scripts/run_hpa_experiment_incluster.sh step   # HPA baseline
bash loadgen/scripts/run_mpc_experiment_incluster.sh step    # MPC controller
bash loadgen/scripts/run_hpa_mpc_batch.sh [N_MPC [N_HPA]]     # matched batch
bash loadgen/scripts/run_mpc_v3_batch.sh all                 # calibrated MPC-only batch

Summarize a run:

mpc-summarize-run --run-dir experiments/_runs/mpc-online/step/<run-id> \
  --out-phase-csv /tmp/step_phases.csv --out-control-csv /tmp/step_control.csv

Artifacts are written to ignored experiments/_runs/. Curated evidence stays ignored under experiments/; the repo commits only lightweight indices.

Local development
make toy-load-run
curl http://localhost:9090/healthz
curl "http://localhost:9090/work?cpu_ms=10&jitter_ms=5"
curl http://localhost:9090/metrics

Useful targets: make help, make fmt, make check, make coverage. make check runs the toy-load checks used in CI (gofmt, go vet, tests, Helm lint, Helm template).

Observability, CI, releases, supply-chain

Metrics. toy-load exports toy_http_requests_total, toy_http_request_duration_seconds, toy_in_flight_requests, toy_work_cpu_seconds, toy_errors_total, and toy_panics_total. PromQL examples in docs/API.md.

CI. gofmt, go vet, go test (race), Go + Python coverage, packaged analysis install, shell + actionlint, dashboard/Helm schema validation, Helm lint + template, Kustomize render.

Security. CodeQL (Go + Python), govulncheck, Trivy fs + image scans, OpenSSF Scorecard, dependency review. All third-party Actions pinned by SHA.

Releases. Tag-driven via docs/RELEASE.md. Images publish to ghcr.io/vshulcz/toy-load with SBOM + provenance. Verify downloads against SHA256SUMS.

Status badges: Release · Pages · Security · CodeQL · Trivy · OpenSSF Scorecard

License

Apache License 2.0. See LICENSE. Security reporting: SECURITY.md. Support: SUPPORT.md.