feat(data-analytics-demo): T-03 data generation + T-12 Python CI#83
Merged
Conversation
Phase 1 of the data-analytics-demo bolt-on. Ships the synthetic data generator (T-03) alongside the Python CI infrastructure (T-12) so the new Python code is verified by CI from the first commit. T-03 — Data generation (AC-1.1 to 1.5 + AC-γ.1 + AC-δ.2): - src/data_analytics_demo/data/schemas.py — Pydantic models for the 4 SaaS tables (Customer / Subscription / Event / Invoice) - src/data_analytics_demo/data/generate.py — Faker + numpy synthesis, deterministic via DEMO_RANDOM_SEED (default 42). Writes a DuckDB file at warehouse/analytics.duckdb. Engineered signal: trailing-30d event drop-off biases churn probability; free-tier customers using premium-feature events bias upsell probability (both observable through SQL, no leak from the generator into the ML feature surface). - tests/test_data_generate.py — 7 pytest cases covering each AC. - Makefile + cli.py — `make data` and `data-analytics-demo data` now do real work instead of exit-1 TODO placeholders. T-12 — Python CI infrastructure: - .github/workflows/python-test.yml — Python 3.11, install editable + dev, run ruff + mypy --strict + pytest (with the 80% coverage gate set in pyproject.toml). - .github/workflows/python-audit.yml — pip-audit --strict against OSV. - .github/dependabot.yml — pip ecosystem on /packages/data-analytics-demo, grouped by dbt / ml / duckdb / dev for review readability. Design note: ADR-0070 mentioned the DuckDB tpcds extension as a synthetic-data source. tpcds is a retail benchmark and does not fit the 4-table SaaS schema this package commits to. Reverted to pure Faker + numpy synthesis; ADR-0070 will be amended in T-13 polish phase to record the deviation. Local verify: - python -m compileall on src/ + tests/ → OK - node scripts/check-doc-drift.mjs → 0 failure(s), 0 warning(s) - node scripts/check-adr-claims.mjs → 77/77 PASS - HIVE-token sweep on new files → 0 hits (D-HIVE-OPACITY)
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
CI feedback from #83 (python-test workflow). All 4 are clean style/typing nits: - F401: drop unused `from collections.abc import Sequence` (TYPE_CHECKING block); the `Sequence` was never referenced. - UP017: `timezone.utc` -> `UTC` (Python 3.11+ alias). - T201: `_emit()` is the deliberate single exception to the print-suppression rule for this package — annotated with `noqa: T201` plus a docstring note so the exception is auditable in code review. - E501: split the timestamps list comprehension at 121 cols into 3 lines. Verify: python -m compileall src/ OK.
…rrides CI feedback from #83 (mypy step). 3 errors clear: - Add `src/data_analytics_demo/py.typed` (PEP 561 marker). Resolves the two `import-untyped` errors on cli.py importing `data_analytics_demo` and `data_analytics_demo.data` — the package now declares inline type info. - Register the marker in [tool.setuptools.package-data] so it ships in the installed wheel. - Add a [[tool.mypy.overrides]] block for pandas / duckdb / faker / shap / xgboost / sklearn — none of these publish type stubs that match the current Python 3.11 + pandas 3.x stack. pandas-stubs exists but lags pandas releases, so ignore_missing_imports is the pragmatic floor.
CI feedback: `mypy src` walked the file as both `src.data_analytics_demo.…` and `data_analytics_demo.…` because the package is editable-installed AND src is on the filesystem path. - pyproject.toml: add `mypy_path = "src"` so mypy resolves the package unambiguously through its installed name. - python-test.yml + Makefile: invoke mypy as `mypy -p data_analytics_demo` (installed-package mode) instead of `mypy src` (filesystem walk). Same coverage, no path collision.
CI feedback: pydantic.EmailStr requires the optional `email-validator` package, which is not in our dependency set. No AC requires email-format validation; the field stores a Faker company_email() string and downstream consumers (dbt staging, ML features) read it as a string anyway. Dropping EmailStr removes the runtime dep without any functional change.
This was referenced May 17, 2026
leagames0221-sys
added a commit
that referenced
this pull request
May 21, 2026
* feat(data-analytics-demo): T-03 data generation + T-12 Python CI infra Phase 1 of the data-analytics-demo bolt-on. Ships the synthetic data generator (T-03) alongside the Python CI infrastructure (T-12) so the new Python code is verified by CI from the first commit. T-03 — Data generation (AC-1.1 to 1.5 + AC-γ.1 + AC-δ.2): - src/data_analytics_demo/data/schemas.py — Pydantic models for the 4 SaaS tables (Customer / Subscription / Event / Invoice) - src/data_analytics_demo/data/generate.py — Faker + numpy synthesis, deterministic via DEMO_RANDOM_SEED (default 42). Writes a DuckDB file at warehouse/analytics.duckdb. Engineered signal: trailing-30d event drop-off biases churn probability; free-tier customers using premium-feature events bias upsell probability (both observable through SQL, no leak from the generator into the ML feature surface). - tests/test_data_generate.py — 7 pytest cases covering each AC. - Makefile + cli.py — `make data` and `data-analytics-demo data` now do real work instead of exit-1 TODO placeholders. T-12 — Python CI infrastructure: - .github/workflows/python-test.yml — Python 3.11, install editable + dev, run ruff + mypy --strict + pytest (with the 80% coverage gate set in pyproject.toml). - .github/workflows/python-audit.yml — pip-audit --strict against OSV. - .github/dependabot.yml — pip ecosystem on /packages/data-analytics-demo, grouped by dbt / ml / duckdb / dev for review readability. Design note: ADR-0070 mentioned the DuckDB tpcds extension as a synthetic-data source. tpcds is a retail benchmark and does not fit the 4-table SaaS schema this package commits to. Reverted to pure Faker + numpy synthesis; ADR-0070 will be amended in T-13 polish phase to record the deviation. Local verify: - python -m compileall on src/ + tests/ → OK - node scripts/check-doc-drift.mjs → 0 failure(s), 0 warning(s) - node scripts/check-adr-claims.mjs → 77/77 PASS - internal-vocabulary sweep on new files → 0 hits * fix(data-analytics-demo): ruff lint — 4 errors in generate.py CI feedback from #83 (python-test workflow). All 4 are clean style/typing nits: - F401: drop unused `from collections.abc import Sequence` (TYPE_CHECKING block); the `Sequence` was never referenced. - UP017: `timezone.utc` -> `UTC` (Python 3.11+ alias). - T201: `_emit()` is the deliberate single exception to the print-suppression rule for this package — annotated with `noqa: T201` plus a docstring note so the exception is auditable in code review. - E501: split the timestamps list comprehension at 121 cols into 3 lines. Verify: python -m compileall src/ OK. * fix(data-analytics-demo): mypy — PEP 561 typed marker + 3rd-party overrides CI feedback from #83 (mypy step). 3 errors clear: - Add `src/data_analytics_demo/py.typed` (PEP 561 marker). Resolves the two `import-untyped` errors on cli.py importing `data_analytics_demo` and `data_analytics_demo.data` — the package now declares inline type info. - Register the marker in [tool.setuptools.package-data] so it ships in the installed wheel. - Add a [[tool.mypy.overrides]] block for pandas / duckdb / faker / shap / xgboost / sklearn — none of these publish type stubs that match the current Python 3.11 + pandas 3.x stack. pandas-stubs exists but lags pandas releases, so ignore_missing_imports is the pragmatic floor. * chore: trigger CI re-run on latest HEAD (d87e787) * fix(data-analytics-demo): mypy src-layout dual-path conflict CI feedback: `mypy src` walked the file as both `src.data_analytics_demo.…` and `data_analytics_demo.…` because the package is editable-installed AND src is on the filesystem path. - pyproject.toml: add `mypy_path = "src"` so mypy resolves the package unambiguously through its installed name. - python-test.yml + Makefile: invoke mypy as `mypy -p data_analytics_demo` (installed-package mode) instead of `mypy src` (filesystem walk). Same coverage, no path collision. * fix(data-analytics-demo): drop EmailStr dependency CI feedback: pydantic.EmailStr requires the optional `email-validator` package, which is not in our dependency set. No AC requires email-format validation; the field stores a Faker company_email() string and downstream consumers (dbt staging, ML features) read it as a string anyway. Dropping EmailStr removes the runtime dep without any functional change. --------- Co-authored-by: leagames0221-sys <leagames0221@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of the data-analytics-demo bolt-on. Ships T-03 (synthetic-data generator) and T-12 (Python CI infrastructure) together so the new Python code is verified by CI from the first commit. The pipeline is one stage less placeholder-y:
make datanow produces a real DuckDB file atwarehouse/analytics.duckdb.T-03 — Data generation
customers,subscriptions,events,invoicesDEMO_N_*env)_emit()writes[data] …lines per stagewarehouse_dir.mkdir(parents=True, exist_ok=True)Faker.seed(seed)+np.random.default_rng(seed)company_email()/company()Engineered ML signal (so T-06 / T-07 have something to learn):
subscription.status = 'canceled'feature_use_premiumevents on free-tier customers correlates with plan upgradeBoth signals are observable through SQL — no leak from the generator into the ML feature surface.
T-12 — Python CI infrastructure
.github/workflows/python-test.yml— Python 3.11, editable-install with dev extras, then ruff + mypy --strict + pytest (coverage floor 80% configured inpyproject.toml).github/workflows/python-audit.yml— pip-audit --strict against OSV, weekly cron + on PRs that touchpyproject.toml.github/dependabot.yml— pip ecosystem under/packages/data-analytics-demo, grouped by dbt / ml / duckdb / dev for review readabilityDesign deviation from ADR-0070
ADR-0070 mentioned the DuckDB
tpcdsextension as a synthetic-data source.tpcdsis a retail benchmark; its 24 tables do not fit the 4-table SaaS schema this package commits to. Reverted to pure Faker + numpy synthesis. ADR-0070 will be amended in the T-13 polish phase to record the deviation transparently.Verify
Local:
node scripts/check-doc-drift.mjs→ 0 failure(s), 0 warning(s)node scripts/check-adr-claims.mjs→ 77/77 PASSCI:
Test plan