feat(data-analytics-demo): T-04 + T-05 dbt staging/intermediate/marts by leagames0221-sys · Pull Request #86 · leagames0221-sys/craftstack

leagames0221-sys · 2026-05-17T13:15:11Z

Summary

Phase 2 of the data-analytics-demo bolt-on. Ships the dbt transformation layer that turns the 4 synthetic SaaS source tables (produced in PR #83) into 4 mart tables that the ML, dashboard, and semantic layers consume next.

What lands

Layer	Files	Materialization
dbt project config	`dbt_project.yml`, `profiles.yml`, `models/staging/_sources.yml`	—
staging (T-04)	`stg_customers` / `stg_subscriptions` / `stg_events` / `stg_invoices`	view × 4
intermediate (T-04)	`int_customer_features`, `int_event_aggregates`	view × 2
marts (T-05)	`rfm_segments`, `churn_features`, `upsell_opportunities`, `cohort_retention`	table × 4
schema tests (T-05)	`marts/schema.yml` (not_null + unique + accepted_values, new dbt generic-test argument syntax)	20 tests

AC coverage

AC	What it asks	How this PR satisfies
2.1	`make dbt` produces 3 marts (+ `cohort_retention`)	4 mart tables created
2.2	each mart ≥ 1 row	rfm 1000, churn 1000, upsell 898, cohort 319
2.3	model compile failure → non-zero exit	dbt run propagates exit code through Makefile
2.4	dbt tests defined per model, all execute	`marts/schema.yml` runs 20 tests against 4 marts
2.5	3-tier layout (staging / intermediate / marts)	yes, with materialization config per tier in `dbt_project.yml`
2.6	`upsell_opportunities` mart produced	yes, with `premium_event_count` / `advanced_event_count` features + `upgraded` label

Local verify (Python 3.12 venv + dbt-duckdb 1.10.1)

dbt run    → 10/10 OK in 4.45s (4 staging views + 2 int views + 4 mart tables)
dbt test   → 20/20 PASS in 0.52s (not_null + unique + accepted_values)

Sanity on engineered ML signals (matches generator design):

Churn rate (canceled subscriptions): 26.2% (designed ~25%)
Upsell rate (initial-free/pro → higher tier): 35.5%
RFM segments distributed across 5 buckets: champions 76 / big_spenders 131 / loyal 197 / at_risk 188 / regular 408

Python + repo checks unchanged:

ruff / mypy / pytest (8 PASS, cov 83.92%)
check-doc-drift.mjs 0 failure
check-adr-claims.mjs 77/77 PASS

Test plan

All required status checks pass (existing 11 + python-test + python-audit)
No new HIVE-token leaks (D-HIVE-OPACITY)

Phase 2 of the data-analytics-demo bolt-on. Ships the dbt layer that transforms the 4 synthetic SaaS source tables into 4 mart tables ready for the ML and dashboard layers. T-04 — dbt scaffold + staging + intermediate (AC-2.5): - dbt_project.yml + profiles.yml (DuckDB target reads ../warehouse/analytics.duckdb) - models/staging/_sources.yml declares the 4 raw tables - models/staging/stg_{customers,subscriptions,events,invoices}.sql (views, thin pass-through + dtype casting) - models/intermediate/int_customer_features.sql (one row per customer combining latest subscription + lifetime event stats + invoice totals) - models/intermediate/int_event_aggregates.sql (per-customer × event_type rollup powering both churn and upsell marts) T-05 — marts + schema tests (AC-2.1, AC-2.2, AC-2.3, AC-2.4, AC-2.6): - marts/rfm_segments.sql (R/F/M quintile scoring + 5-bucket label; self-anchored to max(event_at) so it's reproducible per seed) - marts/churn_features.sql (one row per customer; trailing-30d vs lifetime daily-avg ratio is the engineered churn signal label) - marts/upsell_opportunities.sql (one row per free/pro customer; premium / advanced event counts as engineered signal; upgraded label) - marts/cohort_retention.sql (monthly cohort × months-since-signup grid) - marts/schema.yml (not_null + unique + accepted_values tests, new dbt generic-test argument syntax) Makefile: `dbt` target now runs `dbt run && dbt test` from dbt_project/ with DBT_PROFILES_DIR=. so the project ships its own profile. .gitignore: add dbt_project/.user.yml (dbt's per-developer anonymous tracking UUID; not portfolio-relevant). Local verify on seeded DuckDB (n_customers=1000): - dbt parse → OK (deprecation warnings cleared) - dbt run → 10/10 OK (4 staging views + 2 int views + 4 mart tables) - dbt test → 20/20 PASS (not_null + unique + accepted_values) - Mart rowcounts → rfm 1000, churn 1000, upsell 898, cohort 319 - Sanity numbers → churn rate 26.2% (matches synth ~25% canceled), upsell rate 35.5%, RFM segments distributed across 5 buckets - Python verify → ruff OK / mypy OK / pytest 8 PASS (cov 83.92%) - doc-drift + adr-claims → 0 failure / 77/77 PASS

vercel · 2026-05-17T13:15:16Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
craftstack-collab	Ready	Preview, Comment	May 17, 2026 1:16pm
craftstack-knowledge	Ready	Preview, Comment	May 17, 2026 1:16pm

Phase 3 of the data-analytics-demo bolt-on. Trains two propensity models on the dbt marts shipped in #86 and saves the resulting model artifacts + SHAP summary that the narrative layer (T-08) consumes next. T-06 — Churn pipeline (AC-3.1〜3.5): - ml/churn.py — fits a LogisticRegression baseline AND an XGBoost classifier on `churn_features`, picks the higher hold-out ROC-AUC, and saves model.pkl + metadata.json + shap_summary.json. - ml/explain.py — SHAP wrapper used by both the churn and (later) narrative paths. TreeExplainer first, falls back to model-agnostic. - ml/_io.py — shared mart loader, fails with clear errors when the warehouse / mart is missing (AC-3.4). T-07 — Upsell propensity (AC-3.6〜3.7): - ml/upsell.py — fits a LogisticRegression propensity model on `upsell_opportunities`, measures hold-out ROC-AUC and lift @ top-10%, raises if the lift falls below the 1.5× floor. Data-generator amendment: the churn signal in `data/generate.py` was under-engineered (best ROC-AUC was 0.6972, just below the AC-3.2 0.70 floor). Reworked the event generator so churned customers (a) get 4× lower event weight and (b) have their timestamps biased into the older half of the history window. The mart's `recent_to_lifetime_ratio` feature now correlates cleanly with the cancel label, pushing churn ROC-AUC to 0.7448 on a seed=42 / n_customers=1000 run. Local verify (Python 3.12 venv, deterministic seed=42): - `make data` + `make dbt` + `make ml` end-to-end OK - Churn ROC-AUC = 0.7448 (LR wins; XGBoost 0.7196), AC-3.2 PASS - Upsell lift @ top-10% = 2.81× (vs 1.5× floor), AC-3.7 PASS - ruff OK / mypy OK / pytest 15 PASS, coverage 86.75% - doc-drift 0 fail / adr-claims 77/77 Test infra: switched from `subprocess.run(["dbt", ...])` to `dbt.cli.main.dbtRunner` so the fixtures work on Windows without venv Scripts being on PATH. DuckDB rw-mode for both dbt + ml avoids the "different configuration" connection error when both run in-process. Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>

…#86) Phase 2 of the data-analytics-demo bolt-on. Ships the dbt layer that transforms the 4 synthetic SaaS source tables into 4 mart tables ready for the ML and dashboard layers. T-04 — dbt scaffold + staging + intermediate (AC-2.5): - dbt_project.yml + profiles.yml (DuckDB target reads ../warehouse/analytics.duckdb) - models/staging/_sources.yml declares the 4 raw tables - models/staging/stg_{customers,subscriptions,events,invoices}.sql (views, thin pass-through + dtype casting) - models/intermediate/int_customer_features.sql (one row per customer combining latest subscription + lifetime event stats + invoice totals) - models/intermediate/int_event_aggregates.sql (per-customer × event_type rollup powering both churn and upsell marts) T-05 — marts + schema tests (AC-2.1, AC-2.2, AC-2.3, AC-2.4, AC-2.6): - marts/rfm_segments.sql (R/F/M quintile scoring + 5-bucket label; self-anchored to max(event_at) so it's reproducible per seed) - marts/churn_features.sql (one row per customer; trailing-30d vs lifetime daily-avg ratio is the engineered churn signal label) - marts/upsell_opportunities.sql (one row per free/pro customer; premium / advanced event counts as engineered signal; upgraded label) - marts/cohort_retention.sql (monthly cohort × months-since-signup grid) - marts/schema.yml (not_null + unique + accepted_values tests, new dbt generic-test argument syntax) Makefile: `dbt` target now runs `dbt run && dbt test` from dbt_project/ with DBT_PROFILES_DIR=. so the project ships its own profile. .gitignore: add dbt_project/.user.yml (dbt's per-developer anonymous tracking UUID; not portfolio-relevant). Local verify on seeded DuckDB (n_customers=1000): - dbt parse → OK (deprecation warnings cleared) - dbt run → 10/10 OK (4 staging views + 2 int views + 4 mart tables) - dbt test → 20/20 PASS (not_null + unique + accepted_values) - Mart rowcounts → rfm 1000, churn 1000, upsell 898, cohort 319 - Sanity numbers → churn rate 26.2% (matches synth ~25% canceled), upsell rate 35.5%, RFM segments distributed across 5 buckets - Python verify → ruff OK / mypy OK / pytest 8 PASS (cov 83.92%) - doc-drift + adr-claims → 0 failure / 77/77 PASS Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>

Phase 3 of the data-analytics-demo bolt-on. Trains two propensity models on the dbt marts shipped in #86 and saves the resulting model artifacts + SHAP summary that the narrative layer (T-08) consumes next. T-06 — Churn pipeline (AC-3.1〜3.5): - ml/churn.py — fits a LogisticRegression baseline AND an XGBoost classifier on `churn_features`, picks the higher hold-out ROC-AUC, and saves model.pkl + metadata.json + shap_summary.json. - ml/explain.py — SHAP wrapper used by both the churn and (later) narrative paths. TreeExplainer first, falls back to model-agnostic. - ml/_io.py — shared mart loader, fails with clear errors when the warehouse / mart is missing (AC-3.4). T-07 — Upsell propensity (AC-3.6〜3.7): - ml/upsell.py — fits a LogisticRegression propensity model on `upsell_opportunities`, measures hold-out ROC-AUC and lift @ top-10%, raises if the lift falls below the 1.5× floor. Data-generator amendment: the churn signal in `data/generate.py` was under-engineered (best ROC-AUC was 0.6972, just below the AC-3.2 0.70 floor). Reworked the event generator so churned customers (a) get 4× lower event weight and (b) have their timestamps biased into the older half of the history window. The mart's `recent_to_lifetime_ratio` feature now correlates cleanly with the cancel label, pushing churn ROC-AUC to 0.7448 on a seed=42 / n_customers=1000 run. Local verify (Python 3.12 venv, deterministic seed=42): - `make data` + `make dbt` + `make ml` end-to-end OK - Churn ROC-AUC = 0.7448 (LR wins; XGBoost 0.7196), AC-3.2 PASS - Upsell lift @ top-10% = 2.81× (vs 1.5× floor), AC-3.7 PASS - ruff OK / mypy OK / pytest 15 PASS, coverage 86.75% - doc-drift 0 fail / adr-claims 77/77 Test infra: switched from `subprocess.run(["dbt", ...])` to `dbt.cli.main.dbtRunner` so the fixtures work on Windows without venv Scripts being on PATH. DuckDB rw-mode for both dbt + ml avoids the "different configuration" connection error when both run in-process. Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>

vercel Bot deployed to Preview – craftstack-knowledge May 17, 2026 13:15 View deployment

vercel Bot deployed to Preview – craftstack-collab May 17, 2026 13:16 View deployment

leagames0221-sys merged commit 56536ea into main May 17, 2026
12 checks passed

leagames0221-sys deleted the feat/data-analytics-demo-t04-t05-dbt branch May 17, 2026 13:18

leagames0221-sys mentioned this pull request May 17, 2026

feat(data-analytics-demo): T-06 churn + T-07 upsell ML pipelines #87

Merged

2 tasks

leagames0221-sys mentioned this pull request May 17, 2026

feat(data-analytics-demo): T-13 docs + T-14 changelog/handoff (Stage 4 close-out) #92

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-analytics-demo): T-04 + T-05 dbt staging/intermediate/marts#86

feat(data-analytics-demo): T-04 + T-05 dbt staging/intermediate/marts#86
leagames0221-sys merged 1 commit into
mainfrom
feat/data-analytics-demo-t04-t05-dbt

leagames0221-sys commented May 17, 2026

Uh oh!

vercel Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leagames0221-sys commented May 17, 2026

Summary

What lands

AC coverage

Local verify (Python 3.12 venv + dbt-duckdb 1.10.1)

Test plan

Uh oh!

vercel Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 17, 2026 •

edited

Loading