feat(data-analytics-demo): T-04 + T-05 dbt staging/intermediate/marts#86
Merged
Merged
Conversation
Phase 2 of the data-analytics-demo bolt-on. Ships the dbt layer that
transforms the 4 synthetic SaaS source tables into 4 mart tables ready for
the ML and dashboard layers.
T-04 — dbt scaffold + staging + intermediate (AC-2.5):
- dbt_project.yml + profiles.yml (DuckDB target reads
../warehouse/analytics.duckdb)
- models/staging/_sources.yml declares the 4 raw tables
- models/staging/stg_{customers,subscriptions,events,invoices}.sql (views,
thin pass-through + dtype casting)
- models/intermediate/int_customer_features.sql (one row per customer
combining latest subscription + lifetime event stats + invoice totals)
- models/intermediate/int_event_aggregates.sql (per-customer × event_type
rollup powering both churn and upsell marts)
T-05 — marts + schema tests (AC-2.1, AC-2.2, AC-2.3, AC-2.4, AC-2.6):
- marts/rfm_segments.sql (R/F/M quintile scoring + 5-bucket label;
self-anchored to max(event_at) so it's reproducible per seed)
- marts/churn_features.sql (one row per customer; trailing-30d vs lifetime
daily-avg ratio is the engineered churn signal label)
- marts/upsell_opportunities.sql (one row per free/pro customer; premium /
advanced event counts as engineered signal; upgraded label)
- marts/cohort_retention.sql (monthly cohort × months-since-signup grid)
- marts/schema.yml (not_null + unique + accepted_values tests, new dbt
generic-test argument syntax)
Makefile: `dbt` target now runs `dbt run && dbt test` from dbt_project/
with DBT_PROFILES_DIR=. so the project ships its own profile.
.gitignore: add dbt_project/.user.yml (dbt's per-developer anonymous tracking
UUID; not portfolio-relevant).
Local verify on seeded DuckDB (n_customers=1000):
- dbt parse → OK (deprecation warnings cleared)
- dbt run → 10/10 OK (4 staging views + 2 int views + 4 mart tables)
- dbt test → 20/20 PASS (not_null + unique + accepted_values)
- Mart rowcounts → rfm 1000, churn 1000, upsell 898, cohort 319
- Sanity numbers → churn rate 26.2% (matches synth ~25% canceled), upsell
rate 35.5%, RFM segments distributed across 5 buckets
- Python verify → ruff OK / mypy OK / pytest 8 PASS (cov 83.92%)
- doc-drift + adr-claims → 0 failure / 77/77 PASS
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
2 tasks
leagames0221-sys
added a commit
that referenced
this pull request
May 17, 2026
Phase 3 of the data-analytics-demo bolt-on. Trains two propensity models on the dbt marts shipped in #86 and saves the resulting model artifacts + SHAP summary that the narrative layer (T-08) consumes next. T-06 — Churn pipeline (AC-3.1〜3.5): - ml/churn.py — fits a LogisticRegression baseline AND an XGBoost classifier on `churn_features`, picks the higher hold-out ROC-AUC, and saves model.pkl + metadata.json + shap_summary.json. - ml/explain.py — SHAP wrapper used by both the churn and (later) narrative paths. TreeExplainer first, falls back to model-agnostic. - ml/_io.py — shared mart loader, fails with clear errors when the warehouse / mart is missing (AC-3.4). T-07 — Upsell propensity (AC-3.6〜3.7): - ml/upsell.py — fits a LogisticRegression propensity model on `upsell_opportunities`, measures hold-out ROC-AUC and lift @ top-10%, raises if the lift falls below the 1.5× floor. Data-generator amendment: the churn signal in `data/generate.py` was under-engineered (best ROC-AUC was 0.6972, just below the AC-3.2 0.70 floor). Reworked the event generator so churned customers (a) get 4× lower event weight and (b) have their timestamps biased into the older half of the history window. The mart's `recent_to_lifetime_ratio` feature now correlates cleanly with the cancel label, pushing churn ROC-AUC to 0.7448 on a seed=42 / n_customers=1000 run. Local verify (Python 3.12 venv, deterministic seed=42): - `make data` + `make dbt` + `make ml` end-to-end OK - Churn ROC-AUC = 0.7448 (LR wins; XGBoost 0.7196), AC-3.2 PASS - Upsell lift @ top-10% = 2.81× (vs 1.5× floor), AC-3.7 PASS - ruff OK / mypy OK / pytest 15 PASS, coverage 86.75% - doc-drift 0 fail / adr-claims 77/77 Test infra: switched from `subprocess.run(["dbt", ...])` to `dbt.cli.main.dbtRunner` so the fixtures work on Windows without venv Scripts being on PATH. DuckDB rw-mode for both dbt + ml avoids the "different configuration" connection error when both run in-process. Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>
2 tasks
leagames0221-sys
added a commit
that referenced
this pull request
May 21, 2026
…#86) Phase 2 of the data-analytics-demo bolt-on. Ships the dbt layer that transforms the 4 synthetic SaaS source tables into 4 mart tables ready for the ML and dashboard layers. T-04 — dbt scaffold + staging + intermediate (AC-2.5): - dbt_project.yml + profiles.yml (DuckDB target reads ../warehouse/analytics.duckdb) - models/staging/_sources.yml declares the 4 raw tables - models/staging/stg_{customers,subscriptions,events,invoices}.sql (views, thin pass-through + dtype casting) - models/intermediate/int_customer_features.sql (one row per customer combining latest subscription + lifetime event stats + invoice totals) - models/intermediate/int_event_aggregates.sql (per-customer × event_type rollup powering both churn and upsell marts) T-05 — marts + schema tests (AC-2.1, AC-2.2, AC-2.3, AC-2.4, AC-2.6): - marts/rfm_segments.sql (R/F/M quintile scoring + 5-bucket label; self-anchored to max(event_at) so it's reproducible per seed) - marts/churn_features.sql (one row per customer; trailing-30d vs lifetime daily-avg ratio is the engineered churn signal label) - marts/upsell_opportunities.sql (one row per free/pro customer; premium / advanced event counts as engineered signal; upgraded label) - marts/cohort_retention.sql (monthly cohort × months-since-signup grid) - marts/schema.yml (not_null + unique + accepted_values tests, new dbt generic-test argument syntax) Makefile: `dbt` target now runs `dbt run && dbt test` from dbt_project/ with DBT_PROFILES_DIR=. so the project ships its own profile. .gitignore: add dbt_project/.user.yml (dbt's per-developer anonymous tracking UUID; not portfolio-relevant). Local verify on seeded DuckDB (n_customers=1000): - dbt parse → OK (deprecation warnings cleared) - dbt run → 10/10 OK (4 staging views + 2 int views + 4 mart tables) - dbt test → 20/20 PASS (not_null + unique + accepted_values) - Mart rowcounts → rfm 1000, churn 1000, upsell 898, cohort 319 - Sanity numbers → churn rate 26.2% (matches synth ~25% canceled), upsell rate 35.5%, RFM segments distributed across 5 buckets - Python verify → ruff OK / mypy OK / pytest 8 PASS (cov 83.92%) - doc-drift + adr-claims → 0 failure / 77/77 PASS Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>
leagames0221-sys
added a commit
that referenced
this pull request
May 21, 2026
Phase 3 of the data-analytics-demo bolt-on. Trains two propensity models on the dbt marts shipped in #86 and saves the resulting model artifacts + SHAP summary that the narrative layer (T-08) consumes next. T-06 — Churn pipeline (AC-3.1〜3.5): - ml/churn.py — fits a LogisticRegression baseline AND an XGBoost classifier on `churn_features`, picks the higher hold-out ROC-AUC, and saves model.pkl + metadata.json + shap_summary.json. - ml/explain.py — SHAP wrapper used by both the churn and (later) narrative paths. TreeExplainer first, falls back to model-agnostic. - ml/_io.py — shared mart loader, fails with clear errors when the warehouse / mart is missing (AC-3.4). T-07 — Upsell propensity (AC-3.6〜3.7): - ml/upsell.py — fits a LogisticRegression propensity model on `upsell_opportunities`, measures hold-out ROC-AUC and lift @ top-10%, raises if the lift falls below the 1.5× floor. Data-generator amendment: the churn signal in `data/generate.py` was under-engineered (best ROC-AUC was 0.6972, just below the AC-3.2 0.70 floor). Reworked the event generator so churned customers (a) get 4× lower event weight and (b) have their timestamps biased into the older half of the history window. The mart's `recent_to_lifetime_ratio` feature now correlates cleanly with the cancel label, pushing churn ROC-AUC to 0.7448 on a seed=42 / n_customers=1000 run. Local verify (Python 3.12 venv, deterministic seed=42): - `make data` + `make dbt` + `make ml` end-to-end OK - Churn ROC-AUC = 0.7448 (LR wins; XGBoost 0.7196), AC-3.2 PASS - Upsell lift @ top-10% = 2.81× (vs 1.5× floor), AC-3.7 PASS - ruff OK / mypy OK / pytest 15 PASS, coverage 86.75% - doc-drift 0 fail / adr-claims 77/77 Test infra: switched from `subprocess.run(["dbt", ...])` to `dbt.cli.main.dbtRunner` so the fixtures work on Windows without venv Scripts being on PATH. DuckDB rw-mode for both dbt + ml avoids the "different configuration" connection error when both run in-process. Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 of the data-analytics-demo bolt-on. Ships the dbt transformation layer that turns the 4 synthetic SaaS source tables (produced in PR #83) into 4 mart tables that the ML, dashboard, and semantic layers consume next.
What lands
dbt_project.yml,profiles.yml,models/staging/_sources.ymlstg_customers/stg_subscriptions/stg_events/stg_invoicesint_customer_features,int_event_aggregatesrfm_segments,churn_features,upsell_opportunities,cohort_retentionmarts/schema.yml(not_null + unique + accepted_values, new dbt generic-test argument syntax)AC coverage
make dbtproduces 3 marts (+cohort_retention)marts/schema.ymlruns 20 tests against 4 martsdbt_project.ymlupsell_opportunitiesmart producedpremium_event_count/advanced_event_countfeatures +upgradedlabelLocal verify (Python 3.12 venv + dbt-duckdb 1.10.1)
Sanity on engineered ML signals (matches generator design):
Python + repo checks unchanged:
check-doc-drift.mjs0 failurecheck-adr-claims.mjs77/77 PASSTest plan