Research repository for controlled Kubernetes autoscaling experiments: a controllable Go workload, Helm deployment, HPA baseline, MPC controller, offline simulator, and reproducibility tooling.
The core question: can a small Model Predictive Control loop anticipate demand and scale more smoothly than reactive HPA baselines under step, spike, and seasonal traffic?
In one tracked 200 rps spike pair, Hybrid-SA MPC showed lower burst p95 latency than the HPA60 baseline while both runs kept 100% request success. This is a representative snapshot, not a final benchmark claim. Exact paths, caveats, and rebuild commands live in docs/RESULTS.md.
| What you can inspect | Path |
|---|---|
| Results snapshot and caveats | docs/RESULTS.md |
| Benchmark matrix | docs/BENCHMARK_MATRIX.md |
| Ten-second demo narrative | docs/DEMO.md |
| Public interface | docs/API.md |
| Methodology | docs/METHODOLOGY.md |
| Known limitations | docs/LIMITATIONS.md |
| Static docs site | https://vshulcz.github.io/mpc-autoscaler/ |
| Public roadmap | https://github.com/users/vshulcz/projects/2 |
| Q&A for setup and reproducibility | vshulcz#77 |
Methodology feedback, baseline suggestions, and reproduction reports are welcome.
Docs site: https://vshulcz.github.io/mpc-autoscaler/.
Roadmap board: https://github.com/users/vshulcz/projects/2. Active work is tracked through milestones v0.2.0, thesis-reproducibility, and v0.3.0.
For setup questions, reproduction help, and "which path should I use?" questions, use the Q&A entry thread. If an answer solves your question, mark it as accepted so the next reader can find it quickly. For lightweight contribution ideas or small PR proposals, use the Discussions starter thread. Use Issues for tracked bugs and scoped implementation work.
Contribution guidelines live in CONTRIBUTING.md. Support guidance lives in SUPPORT.md. Release steps are documented in docs/RELEASE.md. Security reporting guidance lives in SECURITY.md.
Representative tracked spike runs, not aggregate benchmark results:
| Controller | Burst throughput | Burst p95 | Burst p99 | Max latency | Success | Max replicas |
|---|---|---|---|---|---|---|
| HPA60 baseline | 197.91 rps | 85.175 ms | 128.983 ms | 276.229 ms | 100.00% | 27 |
| Hybrid-SA MPC | 199.90 rps | 52.483 ms | 71.048 ms | 97.157 ms | 100.00% | 28 |
See docs/RESULTS.md, docs/METHODOLOGY.md, and docs/LIMITATIONS.md before interpreting the numbers.
For broader coverage, see docs/BENCHMARK_MATRIX.md. It separates indexed evidence roots from published numeric claims so missing cells stay visible.
The short version: controlled traffic hits toy-load, Prometheus exposes service metrics, HPA reacts to measured pressure, MPC forecasts short-horizon demand, and analysis tools compare latency, success, and replica behavior. Full storyboard: docs/DEMO.md.
- controllable HTTP workload with Prometheus metrics, Helm chart, raw manifests, and GHCR release image;
- online MPC controller that can run in dry-run mode or apply Kubernetes scale decisions;
- offline simulator and grid-search tooling for controller tuning;
- repeatable HPA and MPC runners for
step,spike, andseasonalityscenarios; - curated evidence policy that keeps bulky raw runs out of Git while preserving provenance;
- CI, release, dependency-update, issue-template, and security automation for public maintenance.
The project combines three parts:
toy-load/: a standalone Go module with a controllable HTTP workload service. Seetoy-load/README.mdfor API details and the runtime configuration table.analysis/: offline and online MPC tooling used to tune and evaluate the controller.deploy/,dashboards/, andloadgen/: ArgoCD applications, monitoring assets, Grafana dashboards, and repeatable load-generation scripts. Seeloadgen/README.mdfor runner details.
This repository is intended for controlled experiments rather than production use. The goal is to compare a reactive HPA-style policy against an MPC-based controller under reproducible traffic profiles, while keeping assumptions and limitations visible.
Supported experiment scenarios:
step: sustained increase in load.spike: short high-intensity burst.seasonality: smooth sinusoidal variation.
Use this repository when you need:
- a small Kubernetes autoscaling lab that can be inspected end to end;
- a controllable workload for testing metrics, HPA behavior, dashboards, and load profiles;
- a reproducible comparison pattern for reactive and predictive scaling policies;
- examples of evidence packaging, caveats, release automation, and public research-software maintenance.
Do not use it as:
- a drop-in production autoscaler;
- proof that MPC is generally better than HPA;
- tuning advice for arbitrary clusters or workloads.
Public contracts and scripting surfaces are documented in docs/API.md.
flowchart LR
Load[Load profiles] --> Workload[toy-load /work]
Workload --> Metrics[Prometheus metrics]
Metrics --> HPA[HPA baseline]
Metrics --> MPC[MPC controller]
HPA --> Scale[Kubernetes scale target]
MPC --> Scale
Scale --> Workload
Runs[Run artifacts] --> Analysis[Analysis package]
Analysis --> Results[Summaries and figures]
See docs/ARCHITECTURE.md for component boundaries, data flow, and extension points.
- Go
1.25 - Python
3.11+ - Docker
kubectl- Helm
- access to a Kubernetes cluster for online experiments
Optional but useful:
vegetafor local load generationcoverage.pyfor local Python coverage reports- a local virtual environment in
.venv/for Python tooling
toy-load/ Standalone Go module for the controllable workload service
cmd/toy-load/ Go application entry point
internal/ Config, HTTP handling, metrics, and workload simulation
deploy/helm/toy-load/ Helm chart for the service
deploy/manifests/ Raw Kubernetes manifests for the service
analysis/
mpc_autoscaler_analysis/ Python package for offline simulation, online control, and artifact summaries
mpc_autoscaler_analysis/data/traces/
Small input traces for offline simulations
tests/ Dependency-light unit tests for analysis tooling
deploy/
argocd/ ArgoCD applications
monitoring/ Kustomize monitoring stack manifests used in experiments
dashboards/ Grafana dashboard JSON
loadgen/scripts/ Local and in-cluster load-generation entry points
docs/ Architecture, reproducibility, and release notes
New experiment artifacts are written to ignored experiments/_runs/ by default.
Curated local evidence and archive roots stay ignored under experiments/; the
repository commits only lightweight indices and packaging instructions.
These are the scripts and commands you are most likely to use:
make -C toy-load run: run the service locally.bash loadgen/scripts/run_hpa_experiment_incluster.sh <scenario>: run one HPA baseline experiment in-cluster.bash loadgen/scripts/run_mpc_experiment_incluster.sh <scenario>: run one MPC-controlled experiment in-cluster.bash loadgen/scripts/run_hpa_mpc_batch.sh [N_MPC [N_HPA]]: run matched HPA and MPC batches.bash loadgen/scripts/run_mpc_v3_batch.sh [scenario|all]: run the calibrated MPC-only batch.mpc-offline-sim ...: run the offline simulator on a trace after installinganalysis.mpc-validate-trace ...: check an offline trace CSV before simulation.docs/REPRODUCIBILITY.md: choose the lightest reproduction path for local checks, offline simulation, saved evidence, or live cluster runs.
Run the service:
make toy-load-run
curl "http://localhost:9090/work?cpu_ms=10&jitter_ms=5"
curl http://localhost:9090/metricsUseful Make targets:
make help
make fmt
make check
make coverage
make toy-load-run
make toy-load-buildmake check runs the toy-load checks used in CI: formatting check, go vet, tests, Helm lint, and Helm template rendering.
make coverage writes Go and Python coverage reports under ignored coverage/.
No cluster needed:
python3 -m venv .venv
source .venv/bin/activate
pip install -e analysis
mpc-generate-synthetic-trace --scenario spike --out analysis/out/spike.csv
mpc-validate-trace --trace-csv analysis/out/spike.csv
mpc-offline-sim --trace-csv analysis/out/spike.csv --out-dir analysis/out/offline/spikeService smoke test:
make toy-load-runIn another terminal:
curl http://localhost:9090/healthz
curl "http://localhost:9090/work?cpu_ms=10&jitter_ms=5"
curl http://localhost:9090/metricsFor offline analysis and the online MPC controller, create a virtual environment and install the analysis package:
python3 -m venv .venv
source .venv/bin/activate
pip install -e analysisExample offline run:
mpc-generate-synthetic-trace \
--scenario step \
--out analysis/out/step.csv
mpc-validate-trace \
--trace-csv analysis/out/step.csv
mpc-offline-sim \
--trace-csv analysis/out/step.csv \
--out-dir analysis/out/offline/stepDeploy with Helm:
helm upgrade --install toy-load toy-load/deploy/helm/toy-load \
--namespace default \
--create-namespaceOr apply the raw manifests:
kubectl apply -f toy-load/deploy/manifestsMonitoring manifests require Prometheus Operator and Grafana Operator CRDs:
kubectl apply -k deploy/monitoringArgoCD application manifests live under deploy/argocd/.
The Helm chart defaults to ghcr.io/vshulcz/toy-load:main. For a pinned run, override the tag explicitly:
helm upgrade --install toy-load toy-load/deploy/helm/toy-load \
--namespace default \
--set image.tag=<commit-or-release-tag>GitHub Releases include cross-platform toy-load binaries, a packaged Helm
chart, and SHA256SUMS. Verify downloaded assets before unpacking or installing
them.
Linux:
VERSION=v0.1.0
ARCH=amd64 # or arm64
BASE="https://github.com/vshulcz/mpc-autoscaler/releases/download/${VERSION}"
curl -LO "${BASE}/toy-load-${VERSION}-linux-${ARCH}.tar.gz"
curl -LO "${BASE}/toy-load-${VERSION#v}.tgz"
curl -LO "${BASE}/SHA256SUMS"
grep -E " (toy-load-${VERSION}-linux-${ARCH}.tar.gz|toy-load-${VERSION#v}.tgz)$" SHA256SUMS \
| sha256sum -c -macOS:
VERSION=v0.1.0
ARCH=arm64 # or amd64
BASE="https://github.com/vshulcz/mpc-autoscaler/releases/download/${VERSION}"
curl -LO "${BASE}/toy-load-${VERSION}-darwin-${ARCH}.tar.gz"
curl -LO "${BASE}/toy-load-${VERSION#v}.tgz"
curl -LO "${BASE}/SHA256SUMS"
grep -E " (toy-load-${VERSION}-darwin-${ARCH}.tar.gz|toy-load-${VERSION#v}.tgz)$" SHA256SUMS \
| shasum -a 256 -c -For a staged reproduction path, start with docs/REPRODUCIBILITY.md. It separates local checks, offline simulations, saved-artifact summaries, live Kubernetes experiments, and release reproduction.
Single-run baseline and MPC workflows:
# HPA baseline
bash loadgen/scripts/run_hpa_experiment_incluster.sh step
# MPC controller
bash loadgen/scripts/run_mpc_experiment_incluster.sh stepMatched batch runs:
# default: 5 MPC runs and 3 HPA runs per scenario
bash loadgen/scripts/run_hpa_mpc_batch.sh
# custom counts
bash loadgen/scripts/run_hpa_mpc_batch.sh 3 2Calibrated MPC-only batch:
# 8 calibrated MPC v3 runs per scenario
bash loadgen/scripts/run_mpc_v3_batch.sh allBatch logs are written to ignored experiments/_runs/progress/.
The online controller writes a CSV control log for each MPC run. A helper script converts run artifacts into compact CSV summaries:
mpc-summarize-run \
--run-dir experiments/_runs/mpc-online/step/<run-id> \
--out-phase-csv /tmp/step_phases.csv \
--out-control-csv /tmp/step_control.csvThe controller uses a backlog-state MPC formulation.
State update between control ticks:
Optimization problem solved each tick:
subject to:
Here,
Key metrics exported by toy-load:
| Metric | Meaning |
|---|---|
toy_http_requests_total{method,path,code} |
request count |
toy_http_request_duration_seconds |
request latency histogram |
toy_in_flight_requests |
current number of in-flight requests |
toy_work_cpu_ms |
requested CPU work per request |
toy_errors_total{reason} |
application error counters |
Useful PromQL queries:
sum(rate(toy_http_requests_total{path="/work"}[1m]))
histogram_quantile(
0.95,
sum(rate(toy_http_request_duration_seconds_bucket{path="/work"}[1m])) by (le)
)
toy_in_flight_requests
GitHub Actions runs the following checks on pushes and pull requests:
- formatting check with
gofmt go vetintoy-load/go test ./...intoy-load/- Go and Python coverage collection with uploaded CI artifacts
- dependency-light Python unit tests and compile checks
- shell syntax checks for experiment runners
- GitHub Actions workflow linting with
actionlint - JSON validation for Grafana dashboards and Helm schema
- Helm lint
- Helm template rendering
- Kustomize rendering for monitoring manifests
- CodeQL analysis for Go and Python
- Go vulnerability scanning with
govulncheck - Trivy filesystem and container image scanning with SARIF uploads
- OpenSSF Scorecard supply-chain checks
- dependency review on pull requests
Container images are built and published to ghcr.io/vshulcz/toy-load on main pushes and release runs. Tags include main, sha-*, semver release tags, and latest for semver releases.
The image build also publishes SBOM and provenance attestations.
Release automation is tag driven:
- run the
Tag Releaseworkflow with a tag likev0.1.0, or push an annotatedv*.*.*tag manually; Releasebuilds cross-platformtoy-loadbinaries, packages the Helm chart, writes checksums, publishes the GHCR image, and creates a GitHub Release.
See docs/RELEASE.md for the full release checklist.
Useful contribution areas include controller comparators, traffic traces, dashboard panels, artifact parsers, Kubernetes portability, and documentation for reproducible experiments.
See ROADMAP.md for project directions and CONTRIBUTING.md before opening a pull request.
Current small-PR queue: Contributor sprint discussion.
Contribution scope:
| Time | Good first contribution |
|---|---|
| 5 minutes | Fix links, examples, glossary text, or README wording. |
| 15 minutes | Add one smoke-test command, metric explanation, or docs-site card. |
| 1 hour | Add a parser test, dashboard panel note, or release verification example. |
| Deeper | Compare controllers, improve benchmark summaries, or add reproducible figures. |
Thanks to contributors whose pull requests have shipped in this repository:
| Contributor | Shipped work |
|---|---|
| @dicnunz | Release checksum verification docs in #25. |
| @tatakaisun | toy-load examples, environment docs, Helm/loadgen references, trace sample docs, and trace CSV validator work in #37-#44. |
| @ayushkli86 | support guide, docs-only checklist, labeler docs, and security footer link in #57-#60. |
| @msaqibatifj | toy-load HTTP status code reference in #95. |
Merged external PRs are credited here and in release notes when they affect a release.
Apache License 2.0. See LICENSE.
