Skip to content

Urthella/costsight

Repository files navigation

costsight — Automated Cloud Cost Anomaly Detection

Project 13 · Cloud Computing · Spring 2025–2026 Furkan Can Karafil · Halil Utku Demirtaş

CI License: MIT Python 3.11+

End-to-end pipeline that ingests AWS CUR-style billing data, runs three anomaly detectors in parallel (STL Decomposition, Isolation Forest, Z-Score), generates severity-scored alerts, and visualizes everything in a Streamlit dashboard.

📄 Full technical write-up: REPORT.md · 🎬 Demo walkthrough: DEMO.md · 🎤 Slide deck: slides/deck.md

Quick start

# 1. Install
python -m venv .venv
. .venv/Scripts/activate          # Windows PowerShell: .venv\Scripts\Activate.ps1
pip install -r requirements.txt

# 2. Generate synthetic data + run the full pipeline
python scripts/run_pipeline.py

# 3. Launch the dashboard
streamlit run dashboard/app.py

Outputs land in outputs/:

  • detections_{detector}.csv — per-day detector flags + scores
  • alerts_{detector}.{csv,json} — severity-banded alert log
  • attribution_{detector}.csv — root-cause hint per alert (which region / usage_type drove the spend)
  • comparison.csv — Precision / Recall / F1 by anomaly type, per detector
  • alert_quality.csv — alert quality (true-positive rate) by severity band

To get statistically defensible numbers (mean ± std across 25 random seeds):

python scripts/run_benchmark.py --seeds 25

To re-render the presentation figures from a fresh run:

python scripts/make_figures.py    # writes slides/figures/*.png

Repository layout

src/cloud_anomaly/
  config.py            project constants (services, paths, severity bands)
  synthetic_data.py    AWS CUR-style data generator + ground-truth labels
  preprocessing.py     load, aggregate, pivot, gap-fill
  detectors/           zscore, stl, iforest, ensemble — common detect(df) interface
  alerts.py            severity = deviation × duration × $impact
  attribution.py       root-cause hint per alert (region / usage_type)
  evaluation.py        Precision / Recall, alert quality, TTD,
                       cost-saved estimate, bootstrap CI, Wilcoxon test
  forecast.py          Holt-Winters per-service forecast + projection
  theoretical_scores.py proposal a-priori ratings (radar charts)
  benchmark.py         multi-seed Monte Carlo runner
  pipeline.py          run() — wires everything together
dashboard/app.py       Streamlit UI (9 tabs: cost trend / alert log /
                       root-cause / detector comparison / calendar /
                       forecast / lab / replay / raw data)
scripts/
  run_pipeline.py        single-run CLI
  run_benchmark.py       25-seed CLI
  make_figures.py        renders presentation PNGs
slides/
  deck.md                Marp slide deck (renders to PDF/HTML)
  SLIDE_UPDATES.md       per-slide guide for the existing deck
  figures/               4 ready-to-use 16:9 PNGs
examples/                committed sample artifacts
tests/                   smoke tests, run on every CI commit
.github/workflows/ci.yml CI: pytest + pipeline on Python 3.11 and 3.12
data/raw/                generated CUR + labels (gitignored)
outputs/                 run artifacts (gitignored)

Anomaly types injected

Type Description Example cause
Point spike Single-day cost explosion Infinite loop
Level shift Persistent step up after change Mis-sized instances
Gradual drift Slow upward creep over a window Data accumulation

Each injected anomaly is recorded in data/raw/ground_truth_labels.csv so detector outputs can be evaluated with real Precision / Recall numbers.

Detector outputs (common schema)

Every detector returns a frame with:

column type meaning
date datetime day
service str AWS service name
cost float observed cost on that day
score float anomaly score (higher = stranger)
is_anomaly bool flagged by the detector

This is what makes the alert module and evaluation framework detector-agnostic.

Empirical results

Mean ± std across 25 random seeds (python scripts/run_benchmark.py --seeds 25). Full table in examples/benchmark_summary.csv.

F1 by anomaly type

Detector Point spike Level shift Gradual drift Overall
Z-Score 0.962 ± 0.078 0.012 ± 0.033 0.000 ± 0.000 0.105 ± 0.018
STL 0.522 ± 0.082 0.616 ± 0.204 0.734 ± 0.052 0.757 ± 0.064
Isolation Forest 0.247 ± 0.035 0.216 ± 0.060 0.217 ± 0.034 0.319 ± 0.036

Headline takeaways

  • No single method wins all anomaly types — the central thesis of the project is empirically supported.
  • STL is the strongest overall detector and handles trend-based anomalies (drift, level shift) cleanly.
  • Z-Score is a perfect point-spike detector but completely blind to drift and level shifts, exactly as expected from a stationary baseline.
  • Isolation Forest catches every point spike (recall = 1.0 there) but struggles to flag persistent shifts because they look "in distribution" once they stabilise — a known limitation of unsupervised tree models on univariate cost data.

Root-cause attribution

For every alert the pipeline produces a one-line, human-readable hint about which CUR dimension drove the spend above its 14-day baseline:

EC2 spend on 2025-03-19 is $957 (+391% vs 14-day baseline); us-east-1 region drove 100% of the increase.

Attribution is computed per (date, service) by decomposing the spend along region and usage_type, comparing against the trailing 14-day per-value baseline, and reporting the dimension+value that contributed most to the anomaly delta. Available in outputs/attribution_{detector}.csv and on the dashboard's Root-cause tab.

This is a Level-1-friendly take on the Level-2 "root-cause attribution" deliverable — concise, deterministic, and immediately useful for FinOps triage.

Running tests

pytest -q

Deploying the dashboard

The Streamlit dashboard is one-click deployable to Streamlit Community Cloud — the easiest path to a live URL for the demo.

  1. Sign in at https://streamlit.io/cloud with your GitHub account.
  2. Click New app, point it at this repository, branch main, main file path: dashboard/app.py.
  3. Python version: 3.11. The platform installs everything from requirements.txt automatically; no extra config is needed.
  4. Once it builds (~3 min), Streamlit publishes a public URL of the form https://<app-name>.streamlit.app. Share it during the demo.

.streamlit/config.toml is committed and pre-configures the dark theme and the brand color, so the deployed instance looks identical to local.

For a containerized deploy (ECS, Cloud Run, Fly.io, Render), see REPORT.md § Cloud architecture.

Docker (one-shot local stack)

docker compose up --build          # dashboard on :8501, REST API on :8000

The compose file boots two services off the same image:

  • dashboard — Streamlit UI (http://localhost:8501).
  • api — FastAPI REST surface (http://localhost:8000, OpenAPI at /docs).

Both mount ./data, ./outputs, and ./examples as volumes so artifacts survive container restarts.

REST API (FastAPI)

The same detection pipeline is also exposed as an HTTP service so it can sit behind API Gateway / ALB in a real cloud deploy.

uvicorn cloud_anomaly.api:app --reload --port 8000

Endpoints:

Method Path Purpose
GET /health Liveness probe
GET / Service metadata + detector list
POST /generate Produce a synthetic dataset (n_days, seed)
POST /detect Run a detector on supplied long-format JSON
POST /alerts Detect → severity-band → root-cause attribution
GET /metrics Multi-detector P/R/F1 against on-disk ground truth
GET /forecast Holt-Winters per-service forecast (horizon_days)

Browse the auto-generated OpenAPI docs at /docs (Swagger UI) or /redoc.

Continuous benchmarking

.github/workflows/benchmark.yml re-runs the 25-seed Monte Carlo every Sunday at 02:00 UTC and uploads outputs/benchmark_summary.csv, outputs/benchmark_raw.csv, and the regenerated presentation figures as a 90-day-retained workflow artifact. Trigger a manual run from the Actions tab if you want fresh numbers ahead of a demo.

Install as a library

After the first release (v1.0.0 tag), the package is on PyPI:

pip install costsight                  # core: detectors + alerts + attribution
pip install "costsight[dashboard]"     # + Streamlit dashboard deps
pip install "costsight[api]"           # + FastAPI / uvicorn
pip install "costsight[llm]"           # + anthropic SDK for AI explanations
pip install "costsight[dev]"           # everything, plus pytest

Shell commands installed alongside the package:

costsight-pipeline --days 90 --seed 42 --scenario drift_heavy
costsight-benchmark --seeds 25
costsight-api --host 0.0.0.0 --port 8000

Programmatic use:

from cloud_anomaly.synthetic_data import generate
from cloud_anomaly.detectors import DETECTORS
from cloud_anomaly.alerts import build_alerts
from cloud_anomaly.carbon import carbon_footprint

cur, labels, _ = generate(n_days=90, seed=42)
detections = DETECTORS["stl"](cur.groupby(["date","service"]).sum().reset_index())
alerts = build_alerts(detections, detector_name="stl", dataset_days=90)
carbon = carbon_footprint(cur)
print(f"This run emitted {carbon.kg_co2:.0f} kgCO₂-eq ({carbon.km_driven_equiv:.0f} km equiv).")

Releases are tag-driven: pushing v1.x.y triggers the .github/workflows/release.yml workflow which builds the sdist + wheel and publishes to PyPI via trusted-publishing (no API token in the repo).

Provision the cloud architecture

The production-path architecture documented in REPORT.md § 4.1 is shipped as a real Terraform module under terraform/:

cd terraform/
terraform init
terraform plan -var="env=dev" -var="alert_email=you@example.com"
terraform apply -var="env=dev" -var="alert_email=you@example.com"

Brings up: S3 raw + aggregated buckets, DynamoDB alerts table (PITR + TTL), SNS alerts topic with optional email subscription, ingest Lambda

  • S3 trigger, and (optionally) a self-hosted dashboard ECS service. Steady-state cost ~$5/mo per tenant at the default toggles.

Scope

Phase 1 (May 20 deadline): synthetic data, three detectors plus an ensemble vote, alert module, root-cause attribution, P/R evaluation, multi-seed benchmark, dashboard with calendar / forecast / lab / replay tabs, statistical significance tests. Phase 2 (post-finals): comparison report extension, paper-style writeup. Out of scope: real-time streaming, multi-cloud ingestion, production deployment of the detection pipeline (the dashboard is deployable; the pipeline remains batch).

License

MIT — see also CONTRIBUTING.md for how to extend the project with new detectors or anomaly types.

Authors

  • Furkan Can Karafil (@Urthella) · 222010020013
  • Halil Utku Demirtaş · 222010020054

About

STL, Isolation Forest, Z-Score detectors on AWS CUR data — severity-banded alerts + Streamlit dashboard. Cloud Computing term project.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors