SRE Simulator

The Break-Fix Game for Azure Kubernetes Platforms

SRE Simulator is a training product that turns incident response into a hands-on game. An AI Dungeon Master generates realistic Kubernetes and OpenShift incidents, and players investigate and resolve them using a structured SRE method.

Why teams use it

Practice real-world Kubernetes and OpenShift troubleshooting in a safe environment.
Reinforce disciplined investigation phases instead of random command spam.
Build confidence in incident handling across junior to principal levels.
Measure decision quality with objective scoring, not only final outcome.

Main product features

AI-generated break-fix scenarios at three difficulty levels.
Guided investigation workflow: Reading -> Context -> Facts -> Theory -> Action.
Chat-driven command support with terminal-style execution feedback.
Dashboard context panels for cluster signals and incident clues.
Score tracking for efficiency, safety, documentation, and accuracy.
Leaderboard to compare performance over time.

How a session works

Choose a difficulty and get an incident ticket.
Investigate via chat, commands, and dashboard context.
Build and test hypotheses using observed evidence.
Apply the fix and complete the scenario.
Review score quality and improvement areas.

Quick start

git clone https://github.com/tuxerrante/SRESimulator.git
cd SRESimulator
make install
make dev

Open http://localhost:3000 in your browser.

Support this project

If SRE Simulator has helped you learn, demo, or teach incident response, and you would like to support my work on it, you can do that here:

GitHub Sponsors: @tuxerrante
Ko-fi: alessandroaffinito
Buy Me a Coffee: tuxerrante
Amazon wishlist: gift cards and project gear

Maintenance

Preview cleanup of generated artifacts in worktrees older than 14 days: make cleanup-worktrees-dry-run
Remove cached modules, coverage, and build output from old worktrees: make cleanup-worktrees
Install a weekly macOS launchd job (Sunday at 04:00 local time) that runs the same cleanup automatically: make install-weekly-worktree-cleanup
Remove the weekly cleanup job: make uninstall-weekly-worktree-cleanup

Deployment targets

Production-style semver deployments now target AKS by default. That path uses GHCR images, Envoy Gateway on the existing static public IP, and the custom hostname https://play.sresimulator.osadev.cloud. The frontend stays on a cluster-internal ClusterIP service, while the backend remains private and is reached only through the frontend's same-origin proxy. Azure SQL-backed backend scaling still activates when DB_SECRET_NAME is provided.

The explicit AKS rollback path is still publicService mode, which promotes only the frontend back to a public LoadBalancer service without changing the rest of the deployment workflow. The previous ARO deployment flow remains supported as a platform fallback; switch between platforms with CLUSTER_FLAVOR=aks|aro locally or PROD_CLUSTER_FLAVOR=aks|aro in GitHub Actions.

Most customer-managed Azure resources remain in the main resource group. The only expected exception is the AKS-managed node resource group, which Azure creates automatically.

Production Sentry rollout is wired through Helm values rather than image rebuilds. frontend.sentry.* populates deployed NEXT_PUBLIC_SENTRY_* container env. Browser Sentry initialization reads a runtime bootstrap script from the same-origin telemetry route and initializes once that config is available, keeping enablement runtime-driven without freezing config at build time. backend.sentry.* continues to drive the server-side SENTRY_* env directly. Keep both disabled until the production DSNs are available, and leave the frontend replay sample rates at 0 for launch unless production volume proves it is safe to raise them.

Browser source-map upload is separate from runtime enablement: the frontend build uploads source maps only when SENTRY_AUTH_TOKEN, SENTRY_ORG, and SENTRY_PROJECT are present at build time. If you set SENTRY_RELEASE, treat it as a build-time release only: it must exactly match the uploaded artifact release, otherwise omit it entirely. Builds still succeed without these variables, but source-map upload stays disabled.

Use generic test events only as ingest smoke checks; verify actor/session/ request correlation with a request-driven chat or command failure through the deployed same-origin proxy path.

Documentation

Product architecture: docs/ARCHITECTURE.md
Runtime internals: docs/AI_RUNTIME.md
Setup, production operations, and post-apply checklist: infra/POST_APPLY_CHECKLIST.md
Release and versioning policy: docs/RELEASES.md
Original product design: CLAUDE.md

Roadmap

Train and deploy a product-specific model profile for better SRE guidance quality and lower response latency.

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.claude		.claude
.cursor/rules		.cursor/rules
.github		.github
backend		backend
badges		badges
docs		docs
frontend		frontend
helm/sre-simulator		helm/sre-simulator
img		img
infra		infra
knowledge_base		knowledge_base
scenarios		scenarios
scripts		scripts
shared		shared
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.markdownlint.jsonc		.markdownlint.jsonc
.npm-version		.npm-version
.pre-commit-config.yaml		.pre-commit-config.yaml
.s2iignore		.s2iignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SRE Simulator

The Break-Fix Game for Azure Kubernetes Platforms

Why teams use it

Main product features

How a session works

Quick start

Support this project

Maintenance

Deployment targets

Documentation

Roadmap

About

Uh oh!

Releases 4

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SRE Simulator

The Break-Fix Game for Azure Kubernetes Platforms

Why teams use it

Main product features

How a session works

Quick start

Support this project

Maintenance

Deployment targets

Documentation

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages