Skip to content

feat(data-analytics-demo): T-03 data generation + T-12 Python CI#83

Merged
leagames0221-sys merged 6 commits into
mainfrom
feat/data-analytics-demo-t03-t12
May 17, 2026
Merged

feat(data-analytics-demo): T-03 data generation + T-12 Python CI#83
leagames0221-sys merged 6 commits into
mainfrom
feat/data-analytics-demo-t03-t12

Conversation

@leagames0221-sys

Copy link
Copy Markdown
Owner

Summary

Phase 1 of the data-analytics-demo bolt-on. Ships T-03 (synthetic-data generator) and T-12 (Python CI infrastructure) together so the new Python code is verified by CI from the first commit. The pipeline is one stage less placeholder-y: make data now produces a real DuckDB file at warehouse/analytics.duckdb.

T-03 — Data generation

AC What lands
1.1 — 4 tables present customers, subscriptions, events, invoices
1.2 — row-count floor defaults: 1000 / 2000 / 50000 / 5000 (overridable via DEMO_N_* env)
1.3 — progress stderr _emit() writes [data] … lines per stage
1.4 — auto-create dir warehouse_dir.mkdir(parents=True, exist_ok=True)
1.5 — deterministic Faker.seed(seed) + np.random.default_rng(seed)
γ.1 — no real PII only Faker company_email() / company()
δ.2 — reproducible identical output for the same seed (test-asserted)

Engineered ML signal (so T-06 / T-07 have something to learn):

  • Churn: trailing-30d event drop-off correlates with subscription.status = 'canceled'
  • Upsell: feature_use_premium events on free-tier customers correlates with plan upgrade

Both signals are observable through SQL — no leak from the generator into the ML feature surface.

T-12 — Python CI infrastructure

  • .github/workflows/python-test.yml — Python 3.11, editable-install with dev extras, then ruff + mypy --strict + pytest (coverage floor 80% configured in pyproject.toml)
  • .github/workflows/python-audit.yml — pip-audit --strict against OSV, weekly cron + on PRs that touch pyproject.toml
  • .github/dependabot.yml — pip ecosystem under /packages/data-analytics-demo, grouped by dbt / ml / duckdb / dev for review readability

Design deviation from ADR-0070

ADR-0070 mentioned the DuckDB tpcds extension as a synthetic-data source. tpcds is a retail benchmark; its 24 tables do not fit the 4-table SaaS schema this package commits to. Reverted to pure Faker + numpy synthesis. ADR-0070 will be amended in the T-13 polish phase to record the deviation transparently.

Verify

Local:

  • python compile check on src/ + tests/ → OK (5 files)
  • node scripts/check-doc-drift.mjs → 0 failure(s), 0 warning(s)
  • node scripts/check-adr-claims.mjs → 77/77 PASS
  • HIVE-token sweep on new files → 0 hits (D-HIVE-OPACITY)

CI:

  • The new python-test workflow will run the 7-test pytest suite + ruff + mypy
  • The new python-audit workflow will check the dependency graph against OSV
  • All 11 existing checks should continue to pass (Python files do not affect the TS pipeline)

Test plan

  • All required status checks pass (existing 11 + 2 new Python workflows)
  • No regression on TS-side CI (Turbo skips this package for lint/test/typecheck because it declares no such scripts)

Phase 1 of the data-analytics-demo bolt-on. Ships the synthetic data
generator (T-03) alongside the Python CI infrastructure (T-12) so the new
Python code is verified by CI from the first commit.

T-03 — Data generation (AC-1.1 to 1.5 + AC-γ.1 + AC-δ.2):
- src/data_analytics_demo/data/schemas.py — Pydantic models for the 4 SaaS
  tables (Customer / Subscription / Event / Invoice)
- src/data_analytics_demo/data/generate.py — Faker + numpy synthesis,
  deterministic via DEMO_RANDOM_SEED (default 42). Writes a DuckDB file at
  warehouse/analytics.duckdb. Engineered signal: trailing-30d event drop-off
  biases churn probability; free-tier customers using premium-feature events
  bias upsell probability (both observable through SQL, no leak from the
  generator into the ML feature surface).
- tests/test_data_generate.py — 7 pytest cases covering each AC.
- Makefile + cli.py — `make data` and `data-analytics-demo data` now do real
  work instead of exit-1 TODO placeholders.

T-12 — Python CI infrastructure:
- .github/workflows/python-test.yml — Python 3.11, install editable + dev,
  run ruff + mypy --strict + pytest (with the 80% coverage gate set in
  pyproject.toml).
- .github/workflows/python-audit.yml — pip-audit --strict against OSV.
- .github/dependabot.yml — pip ecosystem on /packages/data-analytics-demo,
  grouped by dbt / ml / duckdb / dev for review readability.

Design note: ADR-0070 mentioned the DuckDB tpcds extension as a synthetic-data
source. tpcds is a retail benchmark and does not fit the 4-table SaaS schema
this package commits to. Reverted to pure Faker + numpy synthesis; ADR-0070
will be amended in T-13 polish phase to record the deviation.

Local verify:
- python -m compileall on src/ + tests/ → OK
- node scripts/check-doc-drift.mjs       → 0 failure(s), 0 warning(s)
- node scripts/check-adr-claims.mjs      → 77/77 PASS
- HIVE-token sweep on new files          → 0 hits (D-HIVE-OPACITY)
@vercel

vercel Bot commented May 17, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
craftstack-collab Ready Ready Preview, Comment May 17, 2026 11:22am
craftstack-knowledge Ready Ready Preview, Comment May 17, 2026 11:22am

CI feedback from #83 (python-test workflow). All 4 are clean style/typing nits:

- F401: drop unused `from collections.abc import Sequence` (TYPE_CHECKING
  block); the `Sequence` was never referenced.
- UP017: `timezone.utc` -> `UTC` (Python 3.11+ alias).
- T201: `_emit()` is the deliberate single exception to the print-suppression
  rule for this package — annotated with `noqa: T201` plus a docstring note
  so the exception is auditable in code review.
- E501: split the timestamps list comprehension at 121 cols into 3 lines.

Verify: python -m compileall src/ OK.
…rrides

CI feedback from #83 (mypy step). 3 errors clear:

- Add `src/data_analytics_demo/py.typed` (PEP 561 marker). Resolves the two
  `import-untyped` errors on cli.py importing `data_analytics_demo` and
  `data_analytics_demo.data` — the package now declares inline type info.
- Register the marker in [tool.setuptools.package-data] so it ships in the
  installed wheel.
- Add a [[tool.mypy.overrides]] block for pandas / duckdb / faker / shap /
  xgboost / sklearn — none of these publish type stubs that match the current
  Python 3.11 + pandas 3.x stack. pandas-stubs exists but lags pandas
  releases, so ignore_missing_imports is the pragmatic floor.
CI feedback: `mypy src` walked the file as both `src.data_analytics_demo.…`
and `data_analytics_demo.…` because the package is editable-installed AND
src is on the filesystem path.

- pyproject.toml: add `mypy_path = "src"` so mypy resolves the package
  unambiguously through its installed name.
- python-test.yml + Makefile: invoke mypy as `mypy -p data_analytics_demo`
  (installed-package mode) instead of `mypy src` (filesystem walk). Same
  coverage, no path collision.
CI feedback: pydantic.EmailStr requires the optional `email-validator`
package, which is not in our dependency set. No AC requires email-format
validation; the field stores a Faker company_email() string and downstream
consumers (dbt staging, ML features) read it as a string anyway. Dropping
EmailStr removes the runtime dep without any functional change.
@leagames0221-sys leagames0221-sys merged commit 539381f into main May 17, 2026
13 checks passed
@leagames0221-sys leagames0221-sys deleted the feat/data-analytics-demo-t03-t12 branch May 17, 2026 12:08
leagames0221-sys added a commit that referenced this pull request May 21, 2026
* feat(data-analytics-demo): T-03 data generation + T-12 Python CI infra

Phase 1 of the data-analytics-demo bolt-on. Ships the synthetic data
generator (T-03) alongside the Python CI infrastructure (T-12) so the new
Python code is verified by CI from the first commit.

T-03 — Data generation (AC-1.1 to 1.5 + AC-γ.1 + AC-δ.2):
- src/data_analytics_demo/data/schemas.py — Pydantic models for the 4 SaaS
  tables (Customer / Subscription / Event / Invoice)
- src/data_analytics_demo/data/generate.py — Faker + numpy synthesis,
  deterministic via DEMO_RANDOM_SEED (default 42). Writes a DuckDB file at
  warehouse/analytics.duckdb. Engineered signal: trailing-30d event drop-off
  biases churn probability; free-tier customers using premium-feature events
  bias upsell probability (both observable through SQL, no leak from the
  generator into the ML feature surface).
- tests/test_data_generate.py — 7 pytest cases covering each AC.
- Makefile + cli.py — `make data` and `data-analytics-demo data` now do real
  work instead of exit-1 TODO placeholders.

T-12 — Python CI infrastructure:
- .github/workflows/python-test.yml — Python 3.11, install editable + dev,
  run ruff + mypy --strict + pytest (with the 80% coverage gate set in
  pyproject.toml).
- .github/workflows/python-audit.yml — pip-audit --strict against OSV.
- .github/dependabot.yml — pip ecosystem on /packages/data-analytics-demo,
  grouped by dbt / ml / duckdb / dev for review readability.

Design note: ADR-0070 mentioned the DuckDB tpcds extension as a synthetic-data
source. tpcds is a retail benchmark and does not fit the 4-table SaaS schema
this package commits to. Reverted to pure Faker + numpy synthesis; ADR-0070
will be amended in T-13 polish phase to record the deviation.

Local verify:
- python -m compileall on src/ + tests/ → OK
- node scripts/check-doc-drift.mjs       → 0 failure(s), 0 warning(s)
- node scripts/check-adr-claims.mjs      → 77/77 PASS
- internal-vocabulary sweep on new files          → 0 hits

* fix(data-analytics-demo): ruff lint — 4 errors in generate.py

CI feedback from #83 (python-test workflow). All 4 are clean style/typing nits:

- F401: drop unused `from collections.abc import Sequence` (TYPE_CHECKING
  block); the `Sequence` was never referenced.
- UP017: `timezone.utc` -> `UTC` (Python 3.11+ alias).
- T201: `_emit()` is the deliberate single exception to the print-suppression
  rule for this package — annotated with `noqa: T201` plus a docstring note
  so the exception is auditable in code review.
- E501: split the timestamps list comprehension at 121 cols into 3 lines.

Verify: python -m compileall src/ OK.

* fix(data-analytics-demo): mypy — PEP 561 typed marker + 3rd-party overrides

CI feedback from #83 (mypy step). 3 errors clear:

- Add `src/data_analytics_demo/py.typed` (PEP 561 marker). Resolves the two
  `import-untyped` errors on cli.py importing `data_analytics_demo` and
  `data_analytics_demo.data` — the package now declares inline type info.
- Register the marker in [tool.setuptools.package-data] so it ships in the
  installed wheel.
- Add a [[tool.mypy.overrides]] block for pandas / duckdb / faker / shap /
  xgboost / sklearn — none of these publish type stubs that match the current
  Python 3.11 + pandas 3.x stack. pandas-stubs exists but lags pandas
  releases, so ignore_missing_imports is the pragmatic floor.

* chore: trigger CI re-run on latest HEAD (d87e787)

* fix(data-analytics-demo): mypy src-layout dual-path conflict

CI feedback: `mypy src` walked the file as both `src.data_analytics_demo.…`
and `data_analytics_demo.…` because the package is editable-installed AND
src is on the filesystem path.

- pyproject.toml: add `mypy_path = "src"` so mypy resolves the package
  unambiguously through its installed name.
- python-test.yml + Makefile: invoke mypy as `mypy -p data_analytics_demo`
  (installed-package mode) instead of `mypy src` (filesystem walk). Same
  coverage, no path collision.

* fix(data-analytics-demo): drop EmailStr dependency

CI feedback: pydantic.EmailStr requires the optional `email-validator`
package, which is not in our dependency set. No AC requires email-format
validation; the field stores a Faker company_email() string and downstream
consumers (dbt staging, ML features) read it as a string anyway. Dropping
EmailStr removes the runtime dep without any functional change.

---------

Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant