From 1d78eb2a00e7d546c9a3ee4825f3acfe41d725b3 Mon Sep 17 00:00:00 2001 From: leagames0221-sys Date: Mon, 18 May 2026 01:20:26 +0900 Subject: [PATCH] feat(data-analytics-demo): T-13 docs + T-14 changelog/handoff polish MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Final phase of the data-analytics-demo bolt-on. Closes Stage 4 of the Spec-Driven workflow (T-01〜T-14 — 14 / 14 tasks done across 8 PRs). T-13 — Documentation: - packages/data-analytics-demo/README.md: full rewrite. Quickstart in ≤ 5 commands (AC-ε.1), per-layer one-line architecture summary, layout table, constraints (load-bearing), engineered ML signals explanation. - packages/data-analytics-demo/docs/architecture.md: mermaid pipeline diagram covering all 6 layers plus per-layer detail tables and the list of files produced by a full `make demo` run. T-14 — Changelog + handoff: - CHANGELOG.md (root): adds an Unreleased entry recording the data-analytics-demo 0.1.0 ship — six layers, CI infrastructure, security mitigations, test surface, all 8 PRs referenced. The package becomes the seventh `packages/*` entry and the monorepo's first Python sub-tree. - HANDOFF.md (root): "Current" block flipped from "planning phase" to "shipped"; the verified-prior-art table superseded by the ADR-0070 reference (it now lives in the design ADR, not in the ephemeral handoff). Local verify: - ruff OK / mypy OK (21 source files) / pytest 36 PASS / coverage 87.20% - check-doc-drift 0 fail / check-adr-claims 77/77 PASS Stage 4 done. The package can be developed and demoed end-to-end with a single `make demo` invocation; recruiters can clone the repo and read all three deliverables (SQL marts, ML + SHAP, dashboard + narrative) in under five minutes. --- CHANGELOG.md | 11 ++ HANDOFF.md | 37 ++--- packages/data-analytics-demo/README.md | 88 ++++++++---- .../data-analytics-demo/docs/architecture.md | 133 ++++++++++++++++++ 4 files changed, 214 insertions(+), 55 deletions(-) create mode 100644 packages/data-analytics-demo/docs/architecture.md diff --git a/CHANGELOG.md b/CHANGELOG.md index d4ab923..17eb554 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,17 @@ All notable changes to this project are documented here. The format follows [Kee ## [Unreleased] +### Added — `@craftstack/data-analytics-demo` 0.1.0 (Spec-Driven Stage 4, T-01〜T-14) + +Polyglot package (Python + a single JS sub-package member) shipping a local-only customer-analytics demo end-to-end. Built across PRs #82 #83 #86 #87 #88 #89 #90 #91 (this PR is #92, the docs + changelog close-out). [ADR-0070](docs/adr/0070-data-analytics-demo-polyglot-adoption.md) (with the 2026-05-18 dashboard pivot amendment) records the design and the Evidence-vs-Python-Jinja2-Plotly tradeoff. + +- **Six pipeline layers**: data generation (Faker + numpy + DuckDB), dbt transformation (staging / intermediate / marts), ML (LogisticRegression + XGBoost churn — ROC-AUC ≥ 0.70 floor; LogReg upsell propensity — lift @ top-10% ≥ 1.5× floor), local-LLM narrative (Ollama, AC-4.3 cloud-credential guard), self-built static-HTML dashboard (Jinja2 + Plotly), MetricFlow KPI semantic layer with pure-Python validator. +- **CI infrastructure**: `.github/workflows/python-test.yml` (ruff + mypy --strict + pytest with 80 % coverage floor); `.github/workflows/python-audit.yml` (pip-audit `--strict` against OSV); Dependabot `pip` ecosystem grouped by dbt / ml / duckdb / dev. +- **Security mitigations**: `duckdb >= 1.4.2` pin (CVE-2025-64429), no external API credentials anywhere, all generated artifacts gitignored, every dependency listed in ADR-0070 with literal license + maintenance verification. +- **Test surface**: 36 pytest cases (data / dbt / ml-churn / ml-upsell / narrative / dashboard / semantic / e2e); coverage 87.20 %. + +The package becomes the seventh `packages/*` entry and the monorepo's first Python sub-tree. + ## [0.5.19] — 2026-04-29 ### Changed — Run #6 hiring-sim findings closure + deploy-visible-surface coverage extension (ADR-0069) diff --git a/HANDOFF.md b/HANDOFF.md index 200ebe3..6015c6e 100644 --- a/HANDOFF.md +++ b/HANDOFF.md @@ -4,28 +4,17 @@ Tracks ephemeral in-progress state between AI-assisted sessions. For shipped sta ## Current -- **last session**: 2026-05-17 -- **status**: stable on main; opacity-sanitize + handoff infra shipped (PR #79 + #80) -- **active work item**: data analytics demo package (planning phase — prior art scan done, scaffold not yet started) -- **next planned**: Spec-Driven Stage 1 Discovery for `packages/data-analytics-demo/` +- **last session**: 2026-05-18 +- **status**: stable on main; `@craftstack/data-analytics-demo` 0.1.0 shipped (PRs #82 #83 #86 #87 #88 #89 #90 #91 #92) +- **active work item**: none in progress +- **next planned**: TBD — pipeline complete and reproducible via `make demo` inside `packages/data-analytics-demo/` - **blockers**: none -### Planned package — data-analytics-demo +### Shipped — data-analytics-demo (2026-05-18) -Customer-behavior / SaaS-style analytics demo for portfolio. Constraints: local-only (no credit card), local LLM (Ollama), synthetic data only. +Local-only SaaS customer-analytics demo. Six pipeline layers (data / dbt / ml / narrative / dashboard / semantic) plus polyglot CI infrastructure. See [ADR-0070](docs/adr/0070-data-analytics-demo-polyglot-adoption.md) for the design and the dashboard pivot (Evidence → self-built Python + Jinja2 + Plotly). -Verified prior-art seeds (license + maintenance literal-checked 2026-05-17): - -| seed | license | role | -|---|---|---| -| dbt-labs/jaffle_shop_duckdb (default branch: `duckdb`) | Apache 2.0 | dbt project skeleton (staging/marts 2-tier pattern) | -| evidence-dev/evidence | MIT | BI-as-code dashboard (SQL fenced in markdown) | -| dbt-labs/metricflow | Apache 2.0 | semantic layer YAML (single KPI definition) | -| duckdb/duckdb (tpcds extension) | MIT | synthetic SaaS data via `CALL dsdgen(sf=1)` | -| ollama/ollama (Llama 3.1 8B Instruct) | MIT | local LLM for SHAP→narrative | -| Python in Plain English (Faker+DuckDB+sklearn article, 2025-09) | technique reference | churn pipeline pattern (no code clone) | - -Rejected: `dbt-labs/jaffle-shop-template` (no LICENSE + 2.5y unmaintained). +Quickstart: `cd packages/data-analytics-demo && make install && ollama serve & && make demo`. ## Update protocol @@ -45,9 +34,9 @@ When starting a session: ## What lives where -| Information | Location | Lifetime | -|---|---|---| -| Ephemeral in-progress state | this file | days–weeks | -| Decisions (why we chose X) | [docs/adr/](docs/adr/) | permanent | -| Shipped feature log | [CHANGELOG.md](CHANGELOG.md) | permanent | -| Conventions & rules for AI | [apps/*/AGENTS.md](apps/) | permanent | +| Information | Location | Lifetime | +| --------------------------- | ---------------------------- | ---------- | +| Ephemeral in-progress state | this file | days–weeks | +| Decisions (why we chose X) | [docs/adr/](docs/adr/) | permanent | +| Shipped feature log | [CHANGELOG.md](CHANGELOG.md) | permanent | +| Conventions & rules for AI | [apps/\*/AGENTS.md](apps/) | permanent | diff --git a/packages/data-analytics-demo/README.md b/packages/data-analytics-demo/README.md index d653138..c8f31cb 100644 --- a/packages/data-analytics-demo/README.md +++ b/packages/data-analytics-demo/README.md @@ -1,51 +1,77 @@ # @craftstack/data-analytics-demo -> **Status**: Phase 0 scaffold (T-01 / T-02 complete). Pipeline stages T-03 onward are placeholders that exit 1 with a TODO message. See [ADR-0070](../../docs/adr/0070-data-analytics-demo-polyglot-adoption.md) for the design. +Customer-analytics demo for a SaaS-style data set: synthetic data → SQL marts (dbt) → ML (churn + upsell) → narrative (local LLM via Ollama) → BI dashboard (self-built static HTML) → KPI semantic layer (MetricFlow). All seven layers run on a developer laptop, no credit card, no cloud-LLM API calls. -Local-only SaaS customer-analytics demo: synthetic data → SQL marts (dbt) → ML (churn + upsell) → narrative (local LLM via Ollama) → BI dashboard (Evidence) → KPI semantic layer (MetricFlow). +## Why it exists -## Constraints (load-bearing — see ADR-0070) +It is the portfolio answer to a data-analyst job description that explicitly names three axes: -- **Zero credit card** — no Snowflake, BigQuery, Anthropic, OpenAI, or any paid service. Synthetic data only. -- **Local LLM only** — narrative generation runs against a local Ollama server. No external network calls. -- **Consumer laptop** — designed to complete `make demo` on a developer laptop in under 5 minutes. -- **Synthetic data only** — no real customer PII. Faker + DuckDB tpcds generate everything. +1. **Advanced SQL + statistical modelling** — SQL marts and propensity models for churn and upsell. +2. **Business-strategy narratives** — an executive brief generated from the model's own SHAP feature importances. +3. **BI enablement** — a single source of truth (MetricFlow KPI definitions) plus a static dashboard built from the same marts. -## Quickstart +A recruiter cloning this repo can run `make demo` and read all three deliverables in under five minutes. + +## Quickstart (5 commands) ```bash -# 1. Install the package (editable, with dev extras) -make install +make install # editable install + dev extras +ollama serve & # start local Ollama +ollama pull llama3.1:8b-instruct-q4_K_M # or set OLLAMA_MODEL to a model already pulled +make demo # data → dbt → ml → narrative → dashboard → semantic +open dashboard/build/index.html # (or your platform equivalent) +``` + +`make demo` runs the full chain with a visible banner per stage. Any stage failure halts the chain with a non-zero exit code (AC-α.2). + +## Layout + +| Path | Role | +| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `pyproject.toml` | Python package definition + pinned deps (`duckdb >= 1.4.2` mitigates [CVE-2025-64429](https://github.com/duckdb/duckdb/security/advisories/GHSA-vmp8-hg63-v2hp)) | +| `package.json` | pnpm workspace member (scripts proxy to `make`) | +| `Makefile` | Single user-facing entry point — every stage has a target; `make demo` chains all six | +| `src/data_analytics_demo/` | Python source ([data](src/data_analytics_demo/data), [ml](src/data_analytics_demo/ml), [narrative](src/data_analytics_demo/narrative), [dashboard](src/data_analytics_demo/dashboard), [semantic](src/data_analytics_demo/semantic)) | +| `dbt_project/` | dbt project (staging / intermediate / marts; uses `dbt-duckdb`) | +| `semantic/kpi.yml` | MetricFlow-compatible semantic models + KPI metrics (single source of truth) | +| `warehouse/` | Generated DuckDB file (gitignored) | +| `ml/artifacts/` | Generated model + SHAP outputs (gitignored) | +| `narrative/output.md` | Generated LLM narrative (gitignored) | +| `dashboard/build/` | Generated static HTML site (gitignored) | +| `tests/` | pytest suite — one file per layer plus an end-to-end test | +| `docs/architecture.md` | Pipeline diagram + per-layer details | -# 2. Make sure Ollama is running locally and the model is pulled -ollama serve & -ollama pull llama3.1:8b-instruct-q4_K_M +## Architecture (one-line summary per layer) -# 3. Run the full pipeline -make demo +``` +data Faker + numpy synthesise 1000 customers / 50 000 events / 2000 subscriptions / 5000 invoices into DuckDB +dbt staging (4 views) → intermediate (2 views) → marts (rfm_segments, churn_features, upsell_opportunities, cohort_retention) +ml LogisticRegression baseline + XGBoost on churn (ROC-AUC ≥ 0.70) + LogisticRegression on upsell (lift @ top-10% ≥ 1.5×) + SHAP summary +narrative Local Ollama (llama3.1:8b-instruct by default; OLLAMA_MODEL env-var overridable) generates an executive markdown brief from the SHAP summary +dashboard Self-built Python generator (Jinja2 + Plotly via CDN) emits 4 static HTML pages from the marts +semantic MetricFlow YAML — 3 semantic models, 4 KPI metrics; structural invariants enforced by the validator ``` -`make demo` chains: `data → dbt → ml → narrative → dashboard`. Any stage failure halts the pipeline with a non-zero exit code. +See [docs/architecture.md](docs/architecture.md) for the pipeline diagram and per-layer details. -## Layout +## Constraints (load-bearing — see [ADR-0070](../../docs/adr/0070-data-analytics-demo-polyglot-adoption.md)) + +- **Zero credit card.** No Snowflake / BigQuery free trial; no Anthropic / OpenAI / Gemini API. +- **Local LLM only.** Narrative generation runs against a local Ollama; the module asserts the absence of cloud-LLM credentials at invocation time (AC-4.3). +- **Consumer laptop.** End-to-end completes well under five minutes at the default seed sizing. +- **Synthetic data only.** No real PII anywhere; Faker `company_email()` / `company()` generate everything. + +## Engineered ML signals (so the models have something to learn) + +- **Churn**: customers without an active subscription get 4× lower event weight, and their timestamps are biased into the older half of the history window — `recent_to_lifetime_ratio` in `churn_features` correlates with the cancel label. +- **Upsell**: `feature_use_premium` / `feature_use_advanced` event distributions skew higher for paid tiers — `premium_event_count` in `upsell_opportunities` correlates with the upgrade label. -| Path | Role | -| -------------------------- | --------------------------------------------------------------------------- | -| `pyproject.toml` | Python package definition + pinned deps (DuckDB ≥ 1.4.2 for CVE-2025-64429) | -| `package.json` | pnpm workspace member (script proxies to Makefile) | -| `Makefile` | Single entry point — every stage has a target | -| `src/data_analytics_demo/` | Python source (data gen, ML, narrative) | -| `dbt_project/` | dbt project (staging / intermediate / marts) | -| `dashboard/` | Evidence BI sub-project (static HTML build) | -| `semantic/` | MetricFlow KPI definitions | -| `warehouse/` | Generated DuckDB file lives here (gitignored) | -| `ml/artifacts/` | Generated model + SHAP outputs (gitignored) | -| `tests/` | pytest suite covering each layer | +Both signals are observable through SQL alone (no leak from the data generator into the ML feature surface). ## Prior art (pattern extraction only, no clone) -Six OSS projects supply the design pattern; everything is reimplemented from scratch in this package. License + maintenance verified 2026-05-17. See ADR-0070 for the full table including a rejected candidate. +Six OSS projects supplied the design pattern; everything is reimplemented from scratch in this package. License + maintenance literal-verified 2026-05-17. See [ADR-0070](../../docs/adr/0070-data-analytics-demo-polyglot-adoption.md) for the full table including a rejected candidate and the 2026-05-18 dashboard pivot. ## License -MIT — same as the craftstack monorepo. +MIT — same as the rest of the craftstack monorepo. diff --git a/packages/data-analytics-demo/docs/architecture.md b/packages/data-analytics-demo/docs/architecture.md new file mode 100644 index 0000000..fc5eae5 --- /dev/null +++ b/packages/data-analytics-demo/docs/architecture.md @@ -0,0 +1,133 @@ +# Architecture — `@craftstack/data-analytics-demo` + +End-to-end pipeline view. The same DuckDB file (`warehouse/analytics.duckdb`) is the load-bearing artifact — every layer either writes to it or reads from it. + +## Pipeline + +```mermaid +flowchart LR + subgraph generate["1. Data generation (Faker + numpy)"] + G[generate.py] -->|customers / subscriptions / events / invoices| DB[(analytics.duckdb)] + end + + subgraph dbt_layer["2. dbt transformation"] + DB --> STG[staging views × 4] + STG --> INT[intermediate views × 2] + INT --> MARTS["marts × 4
rfm_segments
churn_features
upsell_opportunities
cohort_retention"] + MARTS --> DB + end + + subgraph ml_layer["3. ML pipelines"] + DB -->|churn_features| CHURN[churn.py
LogReg + XGBoost] + DB -->|upsell_opportunities| UPSELL[upsell.py
LogReg propensity] + CHURN --> ART[ml/artifacts/
model.pkl + shap_summary.json] + UPSELL --> ART + end + + subgraph narrative_layer["4. Narrative (local LLM)"] + ART -->|shap_summary.json| NARR[ollama_client + prompts] + NARR --> OLL[(Ollama @ localhost:11434)] + OLL --> NARR + NARR --> NMD[narrative/output.md] + end + + subgraph dashboard_layer["5. Dashboard (Python + Jinja2 + Plotly)"] + DB -->|marts| DASH[render.py + templates] + DASH --> HTML[dashboard/build/
index / rfm / churn / kpi] + end + + subgraph semantic_layer["6. Semantic layer (MetricFlow)"] + KPI[semantic/kpi.yml] --> VAL[validator.py] + VAL --> REP[ValidationReport] + end + + classDef store fill:#fff4e1,stroke:#a16207; + classDef art fill:#e1f5ff,stroke:#0369a1; + classDef out fill:#e7ffe7,stroke:#16a34a; + class DB,OLL store; + class STG,INT,MARTS,ART art; + class NMD,HTML,REP out; +``` + +## Layer details + +### 1. Data generation — `src/data_analytics_demo/data/` + +| File | Role | +| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | +| `schemas.py` | Pydantic models (`Customer`, `Subscription`, `Event`, `Invoice`) define the column shapes the dbt sources expect. | +| `generate.py` | Faker (names / emails / companies) + numpy (deterministic distributions) → 4 tables in DuckDB. Seed lives in `DEMO_RANDOM_SEED` (default 42). | + +The generator deliberately injects two patterns so the ML layer has something to learn: + +- **Churn**: customers without active subscriptions get 4× lower event weight, and their timestamps are biased into the older half of the history window (so `recent_to_lifetime_ratio` carries signal). +- **Upsell**: `feature_use_premium` / `feature_use_advanced` events skew higher for paid tiers. + +### 2. dbt transformation — `dbt_project/` + +Standard staging → intermediate → marts layout, profiled to a local DuckDB. Marts are materialised as tables (the ML / dashboard layers read them); staging and intermediate stay as views. + +| Mart | Grain | Purpose | +| ---------------------- | ----------------------------------------------- | ------------------------------------------------------ | +| `rfm_segments` | one row per active customer | R / F / M quintile scoring + 5-bucket label | +| `churn_features` | one row per customer | Feature table for churn prediction; `is_churned` label | +| `upsell_opportunities` | one row per free/pro customer | Feature table for upsell propensity; `upgraded` label | +| `cohort_retention` | one row per (cohort_month, months_since_signup) | Monthly retention grid | + +20 schema tests (not_null, unique, accepted_values) enforce the contract on the marts. + +### 3. ML — `src/data_analytics_demo/ml/` + +| File | Role | +| ------------ | ------------------------------------------------------------------------------------------------------------------------------------ | +| `_io.py` | Shared mart loader; raises clear errors when warehouse / mart is missing. | +| `churn.py` | Trains a LogisticRegression baseline AND an XGBoost classifier on `churn_features`, picks the higher hold-out ROC-AUC (floor: 0.70). | +| `upsell.py` | LogisticRegression propensity on `upsell_opportunities`; measures hold-out ROC-AUC and lift @ top-10% (floor: 1.5×). | +| `explain.py` | SHAP wrapper. TreeExplainer first; falls back to model-agnostic. | + +Determinism: every random-number-using step takes `random_state=42`. Re-running with the same seed produces byte-identical artifacts. + +### 4. Narrative — `src/data_analytics_demo/narrative/` + +| File | Role | +| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `ollama_client.py` | Env-var-gated host + model resolution; runtime assertion that no cloud-LLM credentials are present (AC-4.3). | +| `prompts.py` | Executive-brief prompt template (SHAP summary → prompt is the only call point). | +| `generate.py` | Reads `shap_summary.json`, calls Ollama, wraps the body with provenance metadata (model id, source path, timestamp, "External LLM calls: 0" advertisement). | + +### 5. Dashboard — `src/data_analytics_demo/dashboard/` + +| File | Role | +| --------------------- | ----------------------------------------------------------------------------------------------------------------- | +| `render.py` | Reads marts via DuckDB, calls into `charts.py` and renders Jinja2 templates. | +| `queries.py` | Centralised SQL queries against the marts. | +| `charts.py` | Plotly figure builders (bar / scatter / line / area / heatmap). CDN-served plotly.js keeps per-page size ≤ 40 KB. | +| `templates/*.html.j2` | Base layout + index / rfm / churn / kpi pages. | + +The original design used Evidence (MIT) but its SvelteKit-based build chain hit four+ chained peer-dependency failures under pnpm 10's isolated layout. The amendment in [ADR-0070](../../../docs/adr/0070-data-analytics-demo-polyglot-adoption.md) documents the pivot. + +### 6. Semantic layer — `src/data_analytics_demo/semantic/` + `semantic/kpi.yml` + +| Asset | Role | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `semantic/kpi.yml` | MetricFlow-compatible semantic models (customers / subscriptions / invoices) + metrics (customers, active_subscriptions, mrr, paid_invoice_volume). | +| `validator.py` | Pure-Python validator. Enforces required keys, non-empty dims/measures, cross-references. Independent of the MetricFlow CLI so the test suite has no shell dependency. | + +## Files produced by a full `make demo` run + +``` +warehouse/analytics.duckdb ← stages 1 + 2 + 5 +ml/artifacts/churn_model.pkl ← stage 3 (churn) +ml/artifacts/churn_metadata.json +ml/artifacts/shap_summary.json +ml/artifacts/upsell_model.pkl ← stage 3 (upsell) +ml/artifacts/upsell_metadata.json +ml/artifacts/upsell_lift_report.json +narrative/output.md ← stage 4 +dashboard/build/index.html ← stage 5 +dashboard/build/rfm.html +dashboard/build/churn.html +dashboard/build/kpi.html +``` + +All output paths are gitignored — only the source code, dbt SQL, semantic YAML, templates, and tests are tracked.