Skip to content

feat(data-analytics-demo): T-04 + T-05 dbt staging/intermediate/marts#86

Merged
leagames0221-sys merged 1 commit into
mainfrom
feat/data-analytics-demo-t04-t05-dbt
May 17, 2026
Merged

feat(data-analytics-demo): T-04 + T-05 dbt staging/intermediate/marts#86
leagames0221-sys merged 1 commit into
mainfrom
feat/data-analytics-demo-t04-t05-dbt

Conversation

@leagames0221-sys

Copy link
Copy Markdown
Owner

Summary

Phase 2 of the data-analytics-demo bolt-on. Ships the dbt transformation layer that turns the 4 synthetic SaaS source tables (produced in PR #83) into 4 mart tables that the ML, dashboard, and semantic layers consume next.

What lands

Layer Files Materialization
dbt project config dbt_project.yml, profiles.yml, models/staging/_sources.yml
staging (T-04) stg_customers / stg_subscriptions / stg_events / stg_invoices view × 4
intermediate (T-04) int_customer_features, int_event_aggregates view × 2
marts (T-05) rfm_segments, churn_features, upsell_opportunities, cohort_retention table × 4
schema tests (T-05) marts/schema.yml (not_null + unique + accepted_values, new dbt generic-test argument syntax) 20 tests

AC coverage

AC What it asks How this PR satisfies
2.1 make dbt produces 3 marts (+ cohort_retention) 4 mart tables created
2.2 each mart ≥ 1 row rfm 1000, churn 1000, upsell 898, cohort 319
2.3 model compile failure → non-zero exit dbt run propagates exit code through Makefile
2.4 dbt tests defined per model, all execute marts/schema.yml runs 20 tests against 4 marts
2.5 3-tier layout (staging / intermediate / marts) yes, with materialization config per tier in dbt_project.yml
2.6 upsell_opportunities mart produced yes, with premium_event_count / advanced_event_count features + upgraded label

Local verify (Python 3.12 venv + dbt-duckdb 1.10.1)

dbt run    → 10/10 OK in 4.45s (4 staging views + 2 int views + 4 mart tables)
dbt test   → 20/20 PASS in 0.52s (not_null + unique + accepted_values)

Sanity on engineered ML signals (matches generator design):

  • Churn rate (canceled subscriptions): 26.2% (designed ~25%)
  • Upsell rate (initial-free/pro → higher tier): 35.5%
  • RFM segments distributed across 5 buckets: champions 76 / big_spenders 131 / loyal 197 / at_risk 188 / regular 408

Python + repo checks unchanged:

  • ruff / mypy / pytest (8 PASS, cov 83.92%)
  • check-doc-drift.mjs 0 failure
  • check-adr-claims.mjs 77/77 PASS

Test plan

  • All required status checks pass (existing 11 + python-test + python-audit)
  • No new HIVE-token leaks (D-HIVE-OPACITY)

Phase 2 of the data-analytics-demo bolt-on. Ships the dbt layer that
transforms the 4 synthetic SaaS source tables into 4 mart tables ready for
the ML and dashboard layers.

T-04 — dbt scaffold + staging + intermediate (AC-2.5):
- dbt_project.yml + profiles.yml (DuckDB target reads
  ../warehouse/analytics.duckdb)
- models/staging/_sources.yml declares the 4 raw tables
- models/staging/stg_{customers,subscriptions,events,invoices}.sql (views,
  thin pass-through + dtype casting)
- models/intermediate/int_customer_features.sql (one row per customer
  combining latest subscription + lifetime event stats + invoice totals)
- models/intermediate/int_event_aggregates.sql (per-customer × event_type
  rollup powering both churn and upsell marts)

T-05 — marts + schema tests (AC-2.1, AC-2.2, AC-2.3, AC-2.4, AC-2.6):
- marts/rfm_segments.sql (R/F/M quintile scoring + 5-bucket label;
  self-anchored to max(event_at) so it's reproducible per seed)
- marts/churn_features.sql (one row per customer; trailing-30d vs lifetime
  daily-avg ratio is the engineered churn signal label)
- marts/upsell_opportunities.sql (one row per free/pro customer; premium /
  advanced event counts as engineered signal; upgraded label)
- marts/cohort_retention.sql (monthly cohort × months-since-signup grid)
- marts/schema.yml (not_null + unique + accepted_values tests, new dbt
  generic-test argument syntax)

Makefile: `dbt` target now runs `dbt run && dbt test` from dbt_project/
with DBT_PROFILES_DIR=. so the project ships its own profile.

.gitignore: add dbt_project/.user.yml (dbt's per-developer anonymous tracking
UUID; not portfolio-relevant).

Local verify on seeded DuckDB (n_customers=1000):
- dbt parse        → OK (deprecation warnings cleared)
- dbt run          → 10/10 OK (4 staging views + 2 int views + 4 mart tables)
- dbt test         → 20/20 PASS (not_null + unique + accepted_values)
- Mart rowcounts   → rfm 1000, churn 1000, upsell 898, cohort 319
- Sanity numbers   → churn rate 26.2% (matches synth ~25% canceled), upsell
                      rate 35.5%, RFM segments distributed across 5 buckets
- Python verify    → ruff OK / mypy OK / pytest 8 PASS (cov 83.92%)
- doc-drift + adr-claims → 0 failure / 77/77 PASS
@vercel

vercel Bot commented May 17, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
craftstack-collab Ready Ready Preview, Comment May 17, 2026 1:16pm
craftstack-knowledge Ready Ready Preview, Comment May 17, 2026 1:16pm

@leagames0221-sys leagames0221-sys merged commit 56536ea into main May 17, 2026
12 checks passed
@leagames0221-sys leagames0221-sys deleted the feat/data-analytics-demo-t04-t05-dbt branch May 17, 2026 13:18
leagames0221-sys added a commit that referenced this pull request May 17, 2026
Phase 3 of the data-analytics-demo bolt-on. Trains two propensity models
on the dbt marts shipped in #86 and saves the resulting model artifacts +
SHAP summary that the narrative layer (T-08) consumes next.

T-06 — Churn pipeline (AC-3.1〜3.5):
- ml/churn.py — fits a LogisticRegression baseline AND an XGBoost
  classifier on `churn_features`, picks the higher hold-out ROC-AUC, and
  saves model.pkl + metadata.json + shap_summary.json.
- ml/explain.py — SHAP wrapper used by both the churn and (later)
  narrative paths. TreeExplainer first, falls back to model-agnostic.
- ml/_io.py — shared mart loader, fails with clear errors when the
  warehouse / mart is missing (AC-3.4).

T-07 — Upsell propensity (AC-3.6〜3.7):
- ml/upsell.py — fits a LogisticRegression propensity model on
  `upsell_opportunities`, measures hold-out ROC-AUC and lift @ top-10%,
  raises if the lift falls below the 1.5× floor.

Data-generator amendment: the churn signal in `data/generate.py` was
under-engineered (best ROC-AUC was 0.6972, just below the AC-3.2 0.70
floor). Reworked the event generator so churned customers (a) get 4×
lower event weight and (b) have their timestamps biased into the older
half of the history window. The mart's `recent_to_lifetime_ratio` feature
now correlates cleanly with the cancel label, pushing churn ROC-AUC to
0.7448 on a seed=42 / n_customers=1000 run.

Local verify (Python 3.12 venv, deterministic seed=42):
- `make data` + `make dbt` + `make ml` end-to-end OK
- Churn ROC-AUC = 0.7448 (LR wins; XGBoost 0.7196), AC-3.2 PASS
- Upsell lift @ top-10% = 2.81× (vs 1.5× floor), AC-3.7 PASS
- ruff OK / mypy OK / pytest 15 PASS, coverage 86.75%
- doc-drift 0 fail / adr-claims 77/77

Test infra: switched from `subprocess.run(["dbt", ...])` to
`dbt.cli.main.dbtRunner` so the fixtures work on Windows without venv
Scripts being on PATH. DuckDB rw-mode for both dbt + ml avoids the
"different configuration" connection error when both run in-process.

Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>
leagames0221-sys added a commit that referenced this pull request May 21, 2026
…#86)

Phase 2 of the data-analytics-demo bolt-on. Ships the dbt layer that
transforms the 4 synthetic SaaS source tables into 4 mart tables ready for
the ML and dashboard layers.

T-04 — dbt scaffold + staging + intermediate (AC-2.5):
- dbt_project.yml + profiles.yml (DuckDB target reads
  ../warehouse/analytics.duckdb)
- models/staging/_sources.yml declares the 4 raw tables
- models/staging/stg_{customers,subscriptions,events,invoices}.sql (views,
  thin pass-through + dtype casting)
- models/intermediate/int_customer_features.sql (one row per customer
  combining latest subscription + lifetime event stats + invoice totals)
- models/intermediate/int_event_aggregates.sql (per-customer × event_type
  rollup powering both churn and upsell marts)

T-05 — marts + schema tests (AC-2.1, AC-2.2, AC-2.3, AC-2.4, AC-2.6):
- marts/rfm_segments.sql (R/F/M quintile scoring + 5-bucket label;
  self-anchored to max(event_at) so it's reproducible per seed)
- marts/churn_features.sql (one row per customer; trailing-30d vs lifetime
  daily-avg ratio is the engineered churn signal label)
- marts/upsell_opportunities.sql (one row per free/pro customer; premium /
  advanced event counts as engineered signal; upgraded label)
- marts/cohort_retention.sql (monthly cohort × months-since-signup grid)
- marts/schema.yml (not_null + unique + accepted_values tests, new dbt
  generic-test argument syntax)

Makefile: `dbt` target now runs `dbt run && dbt test` from dbt_project/
with DBT_PROFILES_DIR=. so the project ships its own profile.

.gitignore: add dbt_project/.user.yml (dbt's per-developer anonymous tracking
UUID; not portfolio-relevant).

Local verify on seeded DuckDB (n_customers=1000):
- dbt parse        → OK (deprecation warnings cleared)
- dbt run          → 10/10 OK (4 staging views + 2 int views + 4 mart tables)
- dbt test         → 20/20 PASS (not_null + unique + accepted_values)
- Mart rowcounts   → rfm 1000, churn 1000, upsell 898, cohort 319
- Sanity numbers   → churn rate 26.2% (matches synth ~25% canceled), upsell
                      rate 35.5%, RFM segments distributed across 5 buckets
- Python verify    → ruff OK / mypy OK / pytest 8 PASS (cov 83.92%)
- doc-drift + adr-claims → 0 failure / 77/77 PASS

Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>
leagames0221-sys added a commit that referenced this pull request May 21, 2026
Phase 3 of the data-analytics-demo bolt-on. Trains two propensity models
on the dbt marts shipped in #86 and saves the resulting model artifacts +
SHAP summary that the narrative layer (T-08) consumes next.

T-06 — Churn pipeline (AC-3.1〜3.5):
- ml/churn.py — fits a LogisticRegression baseline AND an XGBoost
  classifier on `churn_features`, picks the higher hold-out ROC-AUC, and
  saves model.pkl + metadata.json + shap_summary.json.
- ml/explain.py — SHAP wrapper used by both the churn and (later)
  narrative paths. TreeExplainer first, falls back to model-agnostic.
- ml/_io.py — shared mart loader, fails with clear errors when the
  warehouse / mart is missing (AC-3.4).

T-07 — Upsell propensity (AC-3.6〜3.7):
- ml/upsell.py — fits a LogisticRegression propensity model on
  `upsell_opportunities`, measures hold-out ROC-AUC and lift @ top-10%,
  raises if the lift falls below the 1.5× floor.

Data-generator amendment: the churn signal in `data/generate.py` was
under-engineered (best ROC-AUC was 0.6972, just below the AC-3.2 0.70
floor). Reworked the event generator so churned customers (a) get 4×
lower event weight and (b) have their timestamps biased into the older
half of the history window. The mart's `recent_to_lifetime_ratio` feature
now correlates cleanly with the cancel label, pushing churn ROC-AUC to
0.7448 on a seed=42 / n_customers=1000 run.

Local verify (Python 3.12 venv, deterministic seed=42):
- `make data` + `make dbt` + `make ml` end-to-end OK
- Churn ROC-AUC = 0.7448 (LR wins; XGBoost 0.7196), AC-3.2 PASS
- Upsell lift @ top-10% = 2.81× (vs 1.5× floor), AC-3.7 PASS
- ruff OK / mypy OK / pytest 15 PASS, coverage 86.75%
- doc-drift 0 fail / adr-claims 77/77

Test infra: switched from `subprocess.run(["dbt", ...])` to
`dbt.cli.main.dbtRunner` so the fixtures work on Windows without venv
Scripts being on PATH. DuckDB rw-mode for both dbt + ml avoids the
"different configuration" connection error when both run in-process.

Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant