mpc-autoscaler

Can a predictive controller scale Kubernetes earlier than a reactive HPA — and where does it fail? A reproducible lab to find out.

In one tracked 200 rps spike, the MPC controller cut burst p95 latency ~38% (85 → 52 ms) and p99 ~45%, both runs at 100% success.

Then I shortened the spike to 30 seconds — and MPC lost. New Pods became Ready about 40 seconds after the decision, after the spike was already over. The bottleneck wasn't the algorithm; it was the time Kubernetes needs to deliver ready capacity.

That gap — a correct decision arriving before usable capacity — is what this repo lets you measure, reproduce, and break. It is a lab, not a production autoscaler: a tunable Go workload, an HPA baseline, an MPC controller, an offline simulator, and evidence docs that link every published number to the run that produced it.

Results snapshot

One representative tracked spike pair. See docs/RESULTS.md and docs/LIMITATIONS.md before generalising.

Controller	Burst throughput	Burst p95	Burst p99	Max latency	Success	Max replicas
HPA60 baseline	197.91 rps	85.175 ms	128.983 ms	276.229 ms	100.00%	27
Hybrid-SA MPC	199.90 rps	52.483 ms	71.048 ms	97.157 ms	100.00%	28

On a 30 s spike the same controller loses on tail latency, because readiness lag (~40 s) is longer than the spike — see docs/LIMITATIONS.md.

Try it in 60 seconds (no cluster needed)

python3 -m venv .venv && source .venv/bin/activate
pip install -e analysis
mpc-validate-trace --trace-csv analysis/mpc_autoscaler_analysis/data/traces/baseline_spike_profile_dt15.csv
mpc-offline-sim \
  --trace-csv analysis/mpc_autoscaler_analysis/data/traces/baseline_spike_profile_dt15.csv \
  --out-dir analysis/out/offline/spike

This validates the bundled spike trace and runs the offline simulator end to end (writes summary.json, trajectory.csv, and a plot). For staged paths — offline sim, saved evidence, live cluster runs — see docs/REPRODUCIBILITY.md.

Prefer not to install anything? The browser demo renders the same trajectory inline and links every number back to a reproducible source.

Contribute

You do not need a Kubernetes cluster for many useful first contributions.

Make a small verified PR — pick a good first issue, then read CONTRIBUTING.md.
Challenge the methodology — read the 60-second walkthrough and tell me which assumption makes the comparison least convincing, via the Q&A thread or a reproduction report.

Useful areas: controller comparators, traffic traces, dashboard panels, artifact parsers, Kubernetes portability, and reproducibility docs. See ROADMAP.md for directions.

Recent external contributors: @dicnunz, @tatakaisun, @ayushkli86, @msaqibatifj, @mahek56, @kunal-9090. Merged PRs are credited in release notes.

What's inside

flowchart LR
  Load[Load profiles] --> Workload[toy-load /work]
  Workload --> Metrics[Prometheus metrics]
  Metrics --> HPA[HPA baseline]
  Metrics --> MPC[MPC controller]
  HPA --> Scale[Kubernetes scale target]
  MPC --> Scale
  Scale --> Workload
  Runs[Run artifacts] --> Analysis[Analysis package]
  Analysis --> Results[Summaries and figures]

toy-load/ — a standalone Go module: controllable HTTP workload with Prometheus metrics, Helm chart, manifests, and a GHCR release image.
analysis/ — offline simulator, online MPC controller, grid-search tooling, and artifact summaries.
deploy/, dashboards/, loadgen/ — ArgoCD apps, monitoring manifests, Grafana dashboards, and repeatable load runners.

Supported scenarios: step (sustained increase), spike (short burst), seasonality (smooth sinusoidal).

Where to go next

Document	What it covers
`docs/MPC_VS_HPA_60_SECONDS.md`	Fast technical walkthrough: problem, current spike result, trust boundary.
`docs/RESULTS.md`	Exact numbers, evidence paths, caveats, rebuild commands.
`docs/METHODOLOGY.md`	Experiment design and the MPC formulation.
`docs/LIMITATIONS.md`	What these numbers do not prove.
`docs/BENCHMARK_MATRIX.md`	Alias for `RESULTS.md` (kept for legacy links).
`docs/ARCHITECTURE.md`	Component boundaries, data flow, extension points.
`docs/API.md`	Public contracts and scripting surfaces.
`docs/REPRODUCIBILITY.md`	Staged reproduction paths.
`docs/DEMO.md`	Ten-second demo storyboard.
Docs site · Roadmap board	Hosted overview and tracked work.

Prerequisites

For offline work: Go 1.25, Python 3.11+. For live experiments also: Docker, kubectl, Helm, and a Kubernetes cluster. (vegeta optional for local load.)

Running experiments on a cluster

Deploy the workload:

helm upgrade --install toy-load toy-load/deploy/helm/toy-load \
  --namespace default --create-namespace
# or: kubectl apply -f toy-load/deploy/manifests

Monitoring manifests require Prometheus/Grafana Operator CRDs: kubectl apply -k deploy/monitoring. ArgoCD apps live under deploy/argocd/. The chart defaults to ghcr.io/vshulcz/toy-load:main; pin with --set image.tag=<commit-or-release-tag>.

Run experiments:

bash loadgen/scripts/run_hpa_experiment_incluster.sh step   # HPA baseline
bash loadgen/scripts/run_mpc_experiment_incluster.sh step    # MPC controller
bash loadgen/scripts/run_hpa_mpc_batch.sh [N_MPC [N_HPA]]     # matched batch
bash loadgen/scripts/run_mpc_v3_batch.sh all                 # calibrated MPC-only batch

Summarize a run:

mpc-summarize-run --run-dir experiments/_runs/mpc-online/step/<run-id> \
  --out-phase-csv /tmp/step_phases.csv --out-control-csv /tmp/step_control.csv

Artifacts are written to ignored experiments/_runs/. Curated evidence stays ignored under experiments/; the repo commits only lightweight indices.

Local development

make toy-load-run
curl http://localhost:9090/healthz
curl "http://localhost:9090/work?cpu_ms=10&jitter_ms=5"
curl http://localhost:9090/metrics

Useful targets: make help, make fmt, make check, make coverage. make check runs the toy-load checks used in CI (gofmt, go vet, tests, Helm lint, Helm template).

Observability, CI, releases, supply-chain

Metrics. toy-load exports toy_http_requests_total, toy_http_request_duration_seconds, toy_in_flight_requests, toy_work_cpu_seconds, toy_errors_total, and toy_panics_total. PromQL examples in docs/API.md.

CI. gofmt, go vet, go test (race), Go + Python coverage, packaged analysis install, shell + actionlint, dashboard/Helm schema validation, Helm lint + template, Kustomize render.

Security. CodeQL (Go + Python), govulncheck, Trivy fs + image scans, OpenSSF Scorecard, dependency review. All third-party Actions pinned by SHA.

Releases. Tag-driven via docs/RELEASE.md. Images publish to ghcr.io/vshulcz/toy-load with SBOM + provenance. Verify downloads against SHA256SUMS.

Status badges: Release · Pages · Security · CodeQL · Trivy · OpenSSF Scorecard

License

Apache License 2.0. See LICENSE. Security reporting: SECURITY.md. Support: SUPPORT.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mpc-autoscaler

Results snapshot

Try it in 60 seconds (no cluster needed)

Contribute

What's inside

Where to go next

Prerequisites

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github		.github
analysis		analysis
dashboards		dashboards
deploy		deploy
docs		docs
experiments		experiments
loadgen		loadgen
site		site
toy-load		toy-load
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

Folders and files

Latest commit

History

Repository files navigation

mpc-autoscaler

Results snapshot

Try it in 60 seconds (no cluster needed)

Contribute

What's inside

Where to go next

Prerequisites

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages