Project 13 · Cloud Computing · Spring 2025–2026 Furkan Can Karafil · Halil Utku Demirtaş
End-to-end pipeline that ingests AWS CUR-style billing data, runs three anomaly detectors in parallel (STL Decomposition, Isolation Forest, Z-Score), generates severity-scored alerts, and visualizes everything in a Streamlit dashboard.
📄 Full technical write-up:
REPORT.md· 🎬 Demo walkthrough:DEMO.md· 🎤 Slide deck:slides/deck.md
# 1. Install
python -m venv .venv
. .venv/Scripts/activate # Windows PowerShell: .venv\Scripts\Activate.ps1
pip install -r requirements.txt
# 2. Generate synthetic data + run the full pipeline
python scripts/run_pipeline.py
# 3. Launch the dashboard
streamlit run dashboard/app.pyOutputs land in outputs/:
detections_{detector}.csv— per-day detector flags + scoresalerts_{detector}.{csv,json}— severity-banded alert logattribution_{detector}.csv— root-cause hint per alert (which region / usage_type drove the spend)comparison.csv— Precision / Recall / F1 by anomaly type, per detectoralert_quality.csv— alert quality (true-positive rate) by severity band
To get statistically defensible numbers (mean ± std across 25 random seeds):
python scripts/run_benchmark.py --seeds 25To re-render the presentation figures from a fresh run:
python scripts/make_figures.py # writes slides/figures/*.pngsrc/cloud_anomaly/
config.py project constants (services, paths, severity bands)
synthetic_data.py AWS CUR-style data generator + ground-truth labels
preprocessing.py load, aggregate, pivot, gap-fill
detectors/ zscore, stl, iforest, ensemble — common detect(df) interface
alerts.py severity = deviation × duration × $impact
attribution.py root-cause hint per alert (region / usage_type)
evaluation.py Precision / Recall, alert quality, TTD,
cost-saved estimate, bootstrap CI, Wilcoxon test
forecast.py Holt-Winters per-service forecast + projection
theoretical_scores.py proposal a-priori ratings (radar charts)
benchmark.py multi-seed Monte Carlo runner
pipeline.py run() — wires everything together
dashboard/app.py Streamlit UI (9 tabs: cost trend / alert log /
root-cause / detector comparison / calendar /
forecast / lab / replay / raw data)
scripts/
run_pipeline.py single-run CLI
run_benchmark.py 25-seed CLI
make_figures.py renders presentation PNGs
slides/
deck.md Marp slide deck (renders to PDF/HTML)
SLIDE_UPDATES.md per-slide guide for the existing deck
figures/ 4 ready-to-use 16:9 PNGs
examples/ committed sample artifacts
tests/ smoke tests, run on every CI commit
.github/workflows/ci.yml CI: pytest + pipeline on Python 3.11 and 3.12
data/raw/ generated CUR + labels (gitignored)
outputs/ run artifacts (gitignored)
| Type | Description | Example cause |
|---|---|---|
| Point spike | Single-day cost explosion | Infinite loop |
| Level shift | Persistent step up after change | Mis-sized instances |
| Gradual drift | Slow upward creep over a window | Data accumulation |
Each injected anomaly is recorded in data/raw/ground_truth_labels.csv so
detector outputs can be evaluated with real Precision / Recall numbers.
Every detector returns a frame with:
| column | type | meaning |
|---|---|---|
date |
datetime | day |
service |
str | AWS service name |
cost |
float | observed cost on that day |
score |
float | anomaly score (higher = stranger) |
is_anomaly |
bool | flagged by the detector |
This is what makes the alert module and evaluation framework detector-agnostic.
Mean ± std across 25 random seeds (python scripts/run_benchmark.py --seeds 25). Full table in examples/benchmark_summary.csv.
| Detector | Point spike | Level shift | Gradual drift | Overall |
|---|---|---|---|---|
| Z-Score | 0.962 ± 0.078 | 0.012 ± 0.033 | 0.000 ± 0.000 | 0.105 ± 0.018 |
| STL | 0.522 ± 0.082 | 0.616 ± 0.204 | 0.734 ± 0.052 | 0.757 ± 0.064 |
| Isolation Forest | 0.247 ± 0.035 | 0.216 ± 0.060 | 0.217 ± 0.034 | 0.319 ± 0.036 |
- No single method wins all anomaly types — the central thesis of the project is empirically supported.
- STL is the strongest overall detector and handles trend-based anomalies (drift, level shift) cleanly.
- Z-Score is a perfect point-spike detector but completely blind to drift and level shifts, exactly as expected from a stationary baseline.
- Isolation Forest catches every point spike (recall = 1.0 there) but struggles to flag persistent shifts because they look "in distribution" once they stabilise — a known limitation of unsupervised tree models on univariate cost data.
For every alert the pipeline produces a one-line, human-readable hint about which CUR dimension drove the spend above its 14-day baseline:
EC2 spend on 2025-03-19 is $957 (+391% vs 14-day baseline); us-east-1 region drove 100% of the increase.
Attribution is computed per (date, service) by decomposing the spend
along region and usage_type, comparing against the trailing
14-day per-value baseline, and reporting the dimension+value that
contributed most to the anomaly delta. Available in
outputs/attribution_{detector}.csv
and on the dashboard's Root-cause tab.
This is a Level-1-friendly take on the Level-2 "root-cause attribution" deliverable — concise, deterministic, and immediately useful for FinOps triage.
pytest -qThe Streamlit dashboard is one-click deployable to Streamlit Community Cloud — the easiest path to a live URL for the demo.
- Sign in at https://streamlit.io/cloud with your GitHub account.
- Click New app, point it at this repository, branch
main, main file path:dashboard/app.py. - Python version: 3.11. The platform installs everything from
requirements.txtautomatically; no extra config is needed. - Once it builds (~3 min), Streamlit publishes a public URL of the form
https://<app-name>.streamlit.app. Share it during the demo.
.streamlit/config.toml is committed and pre-configures the dark theme
and the brand color, so the deployed instance looks identical to local.
For a containerized deploy (ECS, Cloud Run, Fly.io, Render), see
REPORT.md § Cloud architecture.
docker compose up --build # dashboard on :8501, REST API on :8000The compose file boots two services off the same image:
dashboard— Streamlit UI (http://localhost:8501).api— FastAPI REST surface (http://localhost:8000, OpenAPI at/docs).
Both mount ./data, ./outputs, and ./examples as volumes so artifacts
survive container restarts.
The same detection pipeline is also exposed as an HTTP service so it can sit behind API Gateway / ALB in a real cloud deploy.
uvicorn cloud_anomaly.api:app --reload --port 8000Endpoints:
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
Liveness probe |
| GET | / |
Service metadata + detector list |
| POST | /generate |
Produce a synthetic dataset (n_days, seed) |
| POST | /detect |
Run a detector on supplied long-format JSON |
| POST | /alerts |
Detect → severity-band → root-cause attribution |
| GET | /metrics |
Multi-detector P/R/F1 against on-disk ground truth |
| GET | /forecast |
Holt-Winters per-service forecast (horizon_days) |
Browse the auto-generated OpenAPI docs at /docs (Swagger UI) or /redoc.
.github/workflows/benchmark.yml re-runs the 25-seed Monte Carlo every
Sunday at 02:00 UTC and uploads outputs/benchmark_summary.csv,
outputs/benchmark_raw.csv, and the regenerated presentation figures
as a 90-day-retained workflow artifact. Trigger a manual run from the
Actions tab if you want fresh numbers ahead of a demo.
After the first release (v1.0.0 tag), the package is on PyPI:
pip install costsight # core: detectors + alerts + attribution
pip install "costsight[dashboard]" # + Streamlit dashboard deps
pip install "costsight[api]" # + FastAPI / uvicorn
pip install "costsight[llm]" # + anthropic SDK for AI explanations
pip install "costsight[dev]" # everything, plus pytestShell commands installed alongside the package:
costsight-pipeline --days 90 --seed 42 --scenario drift_heavy
costsight-benchmark --seeds 25
costsight-api --host 0.0.0.0 --port 8000Programmatic use:
from cloud_anomaly.synthetic_data import generate
from cloud_anomaly.detectors import DETECTORS
from cloud_anomaly.alerts import build_alerts
from cloud_anomaly.carbon import carbon_footprint
cur, labels, _ = generate(n_days=90, seed=42)
detections = DETECTORS["stl"](cur.groupby(["date","service"]).sum().reset_index())
alerts = build_alerts(detections, detector_name="stl", dataset_days=90)
carbon = carbon_footprint(cur)
print(f"This run emitted {carbon.kg_co2:.0f} kgCO₂-eq ({carbon.km_driven_equiv:.0f} km equiv).")Releases are tag-driven: pushing v1.x.y triggers the
.github/workflows/release.yml workflow which builds the
sdist + wheel and publishes to PyPI via trusted-publishing
(no API token in the repo).
The production-path architecture documented in
REPORT.md § 4.1 is
shipped as a real Terraform module under terraform/:
cd terraform/
terraform init
terraform plan -var="env=dev" -var="alert_email=you@example.com"
terraform apply -var="env=dev" -var="alert_email=you@example.com"Brings up: S3 raw + aggregated buckets, DynamoDB alerts table (PITR + TTL), SNS alerts topic with optional email subscription, ingest Lambda
- S3 trigger, and (optionally) a self-hosted dashboard ECS service. Steady-state cost ~$5/mo per tenant at the default toggles.
Phase 1 (May 20 deadline): synthetic data, three detectors plus an ensemble vote, alert module, root-cause attribution, P/R evaluation, multi-seed benchmark, dashboard with calendar / forecast / lab / replay tabs, statistical significance tests. Phase 2 (post-finals): comparison report extension, paper-style writeup. Out of scope: real-time streaming, multi-cloud ingestion, production deployment of the detection pipeline (the dashboard is deployable; the pipeline remains batch).
MIT — see also CONTRIBUTING.md for how to extend the project with new detectors or anomaly types.
- Furkan Can Karafil (@Urthella) · 222010020013
- Halil Utku Demirtaş · 222010020054