Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 34 additions & 3 deletions docs/adr/0070-data-analytics-demo-polyglot-adoption.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# ADR-0070: Adopt polyglot (Python + TypeScript) for `packages/data-analytics-demo` — local-only SaaS customer-analytics demo

- Status: Accepted
- Date: 2026-05-17
- Tags: architecture, polyglot, data-analytics, dbt, evidence, ollama, security, supply-chain
- Status: Accepted (amended 2026-05-18 — dashboard pivoted from Evidence to a self-built Python+Jinja2+Plotly generator; see "2026-05-18 amendment" section below)
- Date: 2026-05-17 (original), 2026-05-18 (amendment)
- Tags: architecture, polyglot, data-analytics, dbt, ollama, security, supply-chain
- Companions: [ADR-0001](0001-monorepo-turborepo-pnpm.md) (the monorepo layout this ADR extends with a polyglot package)

## Context
Expand Down Expand Up @@ -104,3 +104,34 @@ Rejected: **dbt-labs/jaffle-shop-template** — no LICENSE file in default branc
- **TypeScript-only**: rejected — see Tradeoff 1. Cannot meet the quality bar with current TS data-analytics tooling.
- **Separate repo (polyrepo)**: rejected — contradicts the monorepo decision in ADR-0001 and forfeits the "complex portfolio operated as a single deliverable" interview signal.
- **Defer the demo**: rejected — the contract brief is live; deferring loses the matching window.

## 2026-05-18 amendment — dashboard pivot

The original Tradeoff 4 chose Evidence as the dashboard generator. Evidence is a high-quality OSS tool (MIT, evidence-dev/evidence, 6k+ stars) and the rationale stands on paper, but the integration cost in this monorepo turned out to be unbounded:

- Evidence ships a SvelteKit-based build (`evidence build`) that requires its own flat `node_modules` for `@sveltejs/kit`, `vite`, `@evidence-dev/tailwind`, and several other transitive peers to be resolvable from generated template code.
- Pnpm 10's isolated layout and strict build-script approval gate broke this in three different ways on consumer Windows; each fix surfaced the next missing peer (chain of four+ peer-dep resolution failures locally before pivoting).
- The dashboard sits at the seam between the Python pipeline (data + dbt + ML + narrative) and the static HTML output. Adopting Evidence meant adopting a second package manager (pnpm or npm) inside an otherwise-Python sub-tree, with its own audit + Dependabot + CI surface.

**Decision**: replace Evidence with a self-built Python+Jinja2+Plotly generator that lives entirely inside `src/data_analytics_demo/dashboard/`. Adds two PyPI deps (jinja2 BSD, plotly MIT — both well-known and already on the audit allowlist) and ships ~150 lines of code that read the same dbt marts and write static HTML to `dashboard/build/`.

### Why this is the better fit

- **Smaller blast radius**: 2 PyPI deps instead of 629 npm deps with the associated peer-dep tangle. Pip-audit covers the surface.
- **Single toolchain**: the dashboard now runs through the same Python venv, ruff, mypy, pytest gates as the rest of the package; no second package manager, no separate workflow.
- **Stronger portfolio signal**: "self-built static dashboard generator from synthetic SaaS marts" reads as analytics-engineering breadth; "I configured Evidence" reads as tool adoption.
- **Full layout control**: Plotly figures + Jinja2 templates give the demo the same chart types Evidence was going to produce (bar / scatter / line / area / heatmap / data table) without the SvelteKit indirection.

### Tradeoff 4 (revised)

| Option | Status | Why |
| ----------------------------------------- | -------------------- | ---------------------------------------------------------------------- |
| **Python + Jinja2 + Plotly (self-built)** | adopted | Single toolchain, 2 PyPI deps, full control, audit-clean |
| Evidence | rejected | Peer-dep chain unbounded in this monorepo; second toolchain added cost |
| Streamlit | rejected (unchanged) | Requires a Python server at view time; no static export |
| Quarto | rejected (unchanged) | BI focus weaker than the alternatives; CLI install required |
| Apache Superset | rejected (unchanged) | Full server with significant install overhead |

### What the rest of this ADR still gets right

Tradeoffs 1 (polyglot), 2 (DuckDB + Faker synthetic data), 3 (dbt), 5 (Ollama), and 6 (MetricFlow) are unchanged. The security mitigations (DuckDB ≥ 1.4.2 pin, pip-audit, Dependabot) and the polyglot CI structure carry over.
3 changes: 1 addition & 2 deletions packages/data-analytics-demo/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,7 @@ narrative:
$(PYTHON) -m data_analytics_demo.narrative.generate

dashboard:
@echo "[dashboard] TODO T-09: Evidence dashboard not yet implemented"
@exit 1
$(PYTHON) -m data_analytics_demo.dashboard.render

semantic-validate:
@echo "[semantic-validate] TODO T-10: MetricFlow validation not yet implemented"
Expand Down
5 changes: 4 additions & 1 deletion packages/data-analytics-demo/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ dependencies = [
# CLI + data validation
"typer>=0.14",
"pydantic>=2.9",
# Dashboard (self-built static HTML; replaces Evidence — see ADR-0070 amendment).
"jinja2>=3.1",
"plotly>=5.24",
]

[project.optional-dependencies]
Expand Down Expand Up @@ -74,7 +77,7 @@ mypy_path = "src"
# but lags behind `pandas` releases; treating these as untyped is the
# pragmatic choice for a Python 3.11 + pandas 3.x stack.
[[tool.mypy.overrides]]
module = ["pandas", "pandas.*", "duckdb", "faker", "shap", "xgboost", "sklearn.*"]
module = ["pandas", "pandas.*", "duckdb", "faker", "shap", "xgboost", "sklearn.*", "plotly", "plotly.*"]
ignore_missing_imports = true

[tool.pytest.ini_options]
Expand Down
9 changes: 9 additions & 0 deletions packages/data-analytics-demo/src/data_analytics_demo/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ def ml() -> None:
)


@app.command()
def dashboard() -> None:
"""Render the static HTML dashboard into dashboard/build/."""
from data_analytics_demo.dashboard import render as dashboard_render

out = dashboard_render.main()
typer.echo(f"wrote dashboard pages to {out}")


@app.command()
def narrative() -> None:
"""Generate an executive narrative from SHAP via local Ollama."""
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Self-built static-HTML dashboard generator (replaces Evidence per ADR-0070 amend).

Reads marts from `warehouse/analytics.duckdb`, builds Plotly figures, and
renders Jinja2 templates into `dashboard/build/{index,rfm,churn,kpi}.html`.

Pure Python — no npm, no SvelteKit, no peer-dep chains. Build is
single-process and reproducible via the same seed that feeds the data
generator.
"""
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
"""Plotly figure builders. Each function returns an HTML string ready to embed."""

from __future__ import annotations

from typing import TYPE_CHECKING

import plotly.express as px

if TYPE_CHECKING:
import pandas as pd

# CDN keeps the per-page HTML small (~10KB instead of 4MB inline plotly.js).
PLOTLY_JS_MODE = "cdn"


def _to_div(fig: object) -> str:
"""Render a plotly figure as a div fragment (no <html><body>)."""
html: str = fig.to_html( # type: ignore[attr-defined]
include_plotlyjs=PLOTLY_JS_MODE,
full_html=False,
config={"displaylogo": False},
)
return html


def rfm_bar(df: pd.DataFrame) -> str:
fig = px.bar(
df,
x="rfm_segment",
y="customers",
text="customers",
title="Customers per RFM segment",
)
fig.update_layout(xaxis_title="Segment", yaxis_title="Customers", height=400)
return _to_div(fig)


def rfm_scatter(df: pd.DataFrame) -> str:
fig = px.scatter(
df,
x="recency_days",
y="frequency_events",
color="rfm_segment",
size="monetary_usd",
hover_data=["customer_id"],
title="Recency × Frequency (size = monetary)",
)
fig.update_layout(
xaxis_title="Recency (days; lower is better)",
yaxis_title="Frequency (event count)",
height=520,
)
return _to_div(fig)


def churn_by_tier_bar(df: pd.DataFrame) -> str:
fig = px.bar(
df,
x="current_plan_tier",
y="churn_pct",
text="churn_pct",
title="Churn rate by plan tier",
)
fig.update_layout(xaxis_title="Plan tier", yaxis_title="Churn %", height=400)
return _to_div(fig)


def signups_line(df: pd.DataFrame) -> str:
fig = px.line(df, x="month", y="signups", title="Monthly signups")
fig.update_layout(xaxis_title="Month", yaxis_title="New customers", height=400)
return _to_div(fig)


def paid_invoice_area(df: pd.DataFrame) -> str:
fig = px.area(
df,
x="month",
y="paid_amount_usd",
title="Paid invoice volume per month (USD)",
)
fig.update_layout(xaxis_title="Month", yaxis_title="USD", height=400)
return _to_div(fig)


def cohort_heatmap(df: pd.DataFrame) -> str:
pivot = df.pivot_table(
index="cohort_month", columns="months_since_signup", values="retention_pct"
)
fig = px.imshow(
pivot,
labels={"x": "Months since signup", "y": "Cohort month", "color": "Retention %"},
title="Cohort retention heatmap",
color_continuous_scale="Blues",
aspect="auto",
)
fig.update_layout(height=480)
return _to_div(fig)
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
"""SQL queries against the dbt marts.

Each function takes an open DuckDB connection and returns a DataFrame.
Centralising the SQL here keeps the templates focused on layout.
"""

from __future__ import annotations

from typing import TYPE_CHECKING

if TYPE_CHECKING:
import duckdb
import pandas as pd


def _scalar(con: duckdb.DuckDBPyConnection, sql: str) -> float:
"""Run a single-cell aggregate query and return the value (or 0 if empty)."""
row = con.execute(sql).fetchone()
if row is None:
return 0.0
return float(row[0])


def headline_metrics(con: duckdb.DuckDBPyConnection) -> dict[str, float]:
"""Top-of-page numbers — customers, active rate, churn rate."""
n_customers = _scalar(con, "select count(*) from customers")
active_rate = _scalar(
con,
"select coalesce(avg(case when status='active' then 1.0 else 0.0 end)*100, 0) "
"from subscriptions",
)
churn_rate = _scalar(
con, "select coalesce(avg(is_churned)*100, 0) from churn_features"
)
return {
"customers": int(n_customers),
"active_rate": round(active_rate, 1),
"churn_rate": round(churn_rate, 1),
}


def rfm_distribution(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
return con.execute(
"""
select rfm_segment, count(*) as customers, round(avg(monetary_usd), 0) as avg_monetary
from rfm_segments
group by rfm_segment
order by customers desc
"""
).fetchdf()


def rfm_scatter(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
return con.execute(
"""
select customer_id, recency_days, frequency_events, monetary_usd, rfm_segment
from rfm_segments
"""
).fetchdf()


def churn_by_tier(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
return con.execute(
"""
select
current_plan_tier,
count(*) as customers,
round(avg(is_churned)*100, 1) as churn_pct,
round(avg(events_last_30d), 1) as avg_events_30d
from churn_features
group by current_plan_tier
order by churn_pct desc
"""
).fetchdf()


def churn_activity_buckets(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
return con.execute(
"""
select
case
when recent_to_lifetime_ratio is null then 'no activity'
when recent_to_lifetime_ratio < 0.3 then '0.0 – 0.3 (slowing)'
when recent_to_lifetime_ratio < 0.7 then '0.3 – 0.7'
when recent_to_lifetime_ratio < 1.5 then '0.7 – 1.5 (steady)'
else '1.5+ (accelerating)'
end as activity_bucket,
count(*) as customers,
round(avg(is_churned)*100, 1) as churn_pct
from churn_features
group by activity_bucket
order by churn_pct desc
"""
).fetchdf()


def monthly_signups(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
return con.execute(
"""
select date_trunc('month', signup_date) as month, count(*) as signups
from customers
group by 1 order by 1
"""
).fetchdf()


def monthly_paid_invoice_volume(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
return con.execute(
"""
select date_trunc('month', period_start) as month,
sum(amount_usd) as paid_amount_usd
from invoices
where status = 'paid'
group by 1 order by 1
"""
).fetchdf()


def cohort_retention_grid(con: duckdb.DuckDBPyConnection) -> pd.DataFrame:
return con.execute(
"""
select cohort_month, months_since_signup, retention_pct
from cohort_retention
order by cohort_month, months_since_signup
"""
).fetchdf()
Loading
Loading