This project is a production-style end-to-end data pipeline that ingests NBA data from public APIs, applies data quality checks, orchestrates workflows with Apache Airflow, loads data into MotherDuck (DuckDB), and transforms it for analytics using dbt.
The focus is on real-world Data Engineering practices
NBA API
↓
Python Ingestion Layer
• API ingestion, rate limiting, error handling
• Data quality checks, ingestion metadata logging
↓
Parquet (Raw Zone)
↓
MotherDuck / DuckDB (RAW schema) — truncate + reload (idempotent)
↓
dbt (staging → marts) — incremental fact supported
↓
Analytics / BI
- Python 3
- nba_api
- Pandas / PyArrow
- Apache Airflow (Dockerized)
- DuckDB + MotherDuck
- Docker & Docker Compose
- Git
- dbt (dbt-duckdb)
- Streamlit
NBA data is ingested using the official nba_api Python package.
CommonAllPlayers→ Players dimensionteams.get_teams()→ Teams dimensionPlayerCareerStats→ Player season statistics (fact table)TeamDetails→ Arenas dimension (arena name, capacity, front office)DraftHistory→ Draft dimension (draft year, round, pick, drafting team)
dim_playersdim_teamsdim_seasonsdim_arenas— arena name, capacity, owner, GM, head coach, G-League affiliate (one row per team)dim_draft— draft year, round, pick number, drafting team, college/org (one row per drafted player)
fact_player_season_stats
Grain: one row per (player_id, season_id, team_id)
Data quality is enforced during ingestion:
- Primary key uniqueness
- Not-null constraints
- Minimum row count thresholds
- Graceful handling of partial API failures
Pipelines continue even if individual API calls fail.
Each ingestion run logs metadata including:
- Pipeline name
- Entity name
- Row count
- Status (
SUCCESS/FAILED) - Error message (if any)
- Run timestamp
Stored as: data/metadata/ingestion_metadata.parquet
This enables monitoring, debugging, and auditing.
Apache Airflow is fully Dockerized.
The DAG includes an idempotency guard so reruns don't duplicate work:
- Task:
idempotency_guard(ShortCircuitOperator) runs first. - Logic: If a successful run already exists for the same logical date (execution date), the guard returns
Falseand all downstream tasks are skipped (ingestion, load, transform). - DAG setting:
max_active_runs=1so only one run executes at a time.
To force a full re-run for the same date, clear or mark the existing successful run before triggering again.
- Run dimension and fact ingestions (PythonOperator)
- Load parquet into MotherDuck (truncate + reload)
- Run dbt transform (staging + marts)
- Enforce task dependencies, retries, SLAs
idempotency_guard
↓
ingestion:
fetch_dim_players ──┐
fetch_dim_teams ────┴──► fetch_fact_player_season_stats ──► fetch_dim_seasons
fetch_dim_teams ────────► fetch_dim_arenas
fetch_dim_draft (independent)
↓
load: load_dim_players, load_dim_teams, load_dim_seasons, load_fact_player_season_stats,
load_dim_arenas, load_dim_draft
↓
transform: dbt_run
- Raw data is loaded into MotherDuck
RAWschema. - Tables are created if they do not exist.
- Idempotent loads: Each load drops and recreates the table before inserting from parquet, so reruns do not create duplicate rows (full refresh per run).
- Connection is managed via environment variables (
.env).
MOTHERDUCK_TOKEN=<your token>
MOTHERDUCK_DATABASE=nba # optional, defaults to "nba"
DBT_TARGET_SCHEMA=analytics # optional, defaults to "analytics"
To drop every pipeline table in MotherDuck and then repopulate only raw tables:
-
Load env (from project root):
set -a && source .env && set +a -
Drop all MotherDuck tables:
python -m warehouse.drop_all_snowflake_objects -
Run ingestion (writes parquet to
data/raw/):python -m ingestion.fetch_players python -m ingestion.fetch_teams python -m ingestion.fetch_player_season_stats python -m ingestion.fetch_seasons python -m ingestion.fetch_arenas python -m ingestion.fetch_draft(Or run the full DAG and stop after the load task group.)
-
Run load (copy parquet into MotherDuck RAW):
python -m warehouse.load_dim_players python -m warehouse.load_dim_teams python -m warehouse.load_dim_seasons python -m warehouse.load_fact_player_season_stats python -m warehouse.load_dim_arenas python -m warehouse.load_dim_draft -
Verify
ingested_atin raw:python -m warehouse.verify_raw_ingested_at
dbt runs after the load step and builds the analytics layer in MotherDuck.
- Adapter:
dbt-duckdb, connecting to MotherDuck viaMOTHERDUCK_TOKEN. - Profile:
dbt_nba/profiles.yml— usestype: duckdb, pathmd:<database>. - Staging: Views in the
stagingschema (e.g.stg_players,stg_teams,stg_player_season_stats,stg_seasons,stg_arenas,stg_draft). - Marts: Incremental tables in the
martsschema:dim_players,dim_teams,dim_seasons,dim_arenas,dim_draft— MERGE on primary keyfact_player_season_stats— MERGE on(player_id, season_id, team_id)player_season_performance,active_players
- Incremental: All mart models use MERGE strategy. Since raw is truncated+reloaded each run, all rows are processed on every run. Use
--full-refreshto rebuild from scratch.
dbt is invoked from the Airflow DAG task transform.dbt_run. Run locally:
cd dbt_nba
dbt deps
dbt run --profiles-dir .
dbt test --profiles-dir .
dbt docs generate --profiles-dir . && dbt docs serve --profiles-dir .Data governance is implemented across catalog, lineage, quality, ownership, and policies:
- Catalog & lineage: dbt docs (
dbt docs generate && dbt docs serve), YAML descriptions, ingestion metadata. - Quality: Ingestion checks (
ingestion/data_quality.py) and dbt tests (dbt test). - Ownership: dbt
meta(owner, tier) and Airflow DAG owner; see generated docs. - Policies: Naming (dim/fact/stg), SLAs on tasks, idempotency (guard + truncate-reload).
See docs/DATA_GOVERNANCE.md for the full governance guide.