Full-stack, reproducible pipeline to estimate NBA win probabilities before tipoff. Python/FastAPI backend, Next.js frontend, LightGBM + NEAT models, PostgreSQL for persistence.
- Automated ingestion from Balldontlie (schedule/results), NBA.com stats (
nba_api) as primary, Kaggle CSVs for odds/schedules/box scores, and optional live odds via The Odds API. - Feature builders for schedule context, roster health/continuity, priors, recency, and odds; synthetic features let us predict any matchup/date even when out-of-window.
- LightGBM pipeline with permutation pruning, calibration, SHAP, and time-aware splits; NEAT-based neuroevolution for feature discovery.
- FastAPI service exposing
/predict,/games,/teams,/elo,/feature-usage,/model-metadata; Postgres-backed or parquet fallback. - Next.js + Tailwind UI for today’s slate, arbitrary head-to-head predictions, model snapshot, and diagnostics.
- Cron-friendly daily refresh scripts to append new games and retrain artifacts.
- Backend: Python 3.11, FastAPI, Uvicorn, Pydantic v2, httpx/tenacity, structlog, pandas/pyarrow/numpy, SQLAlchemy + psycopg, cachetools.
- Modeling: LightGBM, scikit-learn (calibration/metrics), SHAP, deap + custom NEAT, joblib, mlflow hooks.
- Data sources:
nba_api(scoreboard/stats), Balldontlie, Kaggle NBA datasets, optional The Odds API. - Frontend: Next.js 14 (App Router), React 18, TypeScript, TailwindCSS, axios; Jest/RTL + Playwright for tests.
- Infra: PostgreSQL primary store; artifacts in
artifacts/(ignored); cron scripts inCRON_SETUP_*.
- Install Python 3.11 and Node.js 18+.
- Create a virtual environment and install dependencies:
python -m venv .venv source .venv/bin/activate pip install -e .[dev] - Copy the example environment file and add secrets:
cp .env.example .env
- Install Node dependencies for the web app:
cd src/web/next-app npm install - During development run both services together:
This starts Next.js on 3000 and FastAPI (uvicorn) on 8000.
npm run dev
| Variable | Description |
|---|---|
BALLDONTLIE_API_KEY |
Aux schedule/results; free tier limits apply. |
THEODDS_API_KEY |
Optional live odds (The Odds API). |
KAGGLE_USERNAME, KAGGLE_KEY |
Optional Kaggle CLI credentials for automated dataset downloads. |
DATABASE_URL |
PostgreSQL connection string (e.g., postgresql://user:pass@127.0.0.1:5433/nba). |
Never commit .env or secret files. Use system keyring if available.
Defaults in configs/default.yml. Override via CLI flags or alt config files. Key sections: data (season range/windows), sources (provider toggles), model (LightGBM/odds/calibrator), pruning, recency, web (odds toggle defaults).
- Balldontlie: https://www.balldontlie.io/ (schedule/results; throttled free tier)
- NBA.com via
nba_api: https://github.com/swar/nba_api (scoreboard/stats; primary) - Kaggle NBA betting/box-score data (e.g., spreads/totals/moneylines): https://www.kaggle.com/
- Kaggle player box scores & schedules (current dataset): https://www.kaggle.com/datasets/eoinamoore/historical-nba-data-and-player-box-scores
- The Odds API (optional live odds): https://the-odds-api.com/
- In-house ELO (scripted):
scripts/ratings/build_elo.py
All feature builders honor cutoff_ts (pre-tipoff). Chronological splits only; no shuffled splits. Tests guard that max timestamps precede labels.
python scripts/validate_data.py --output reports/data_check.jsonpython scripts/fetch_games.py --season-start 2024 --season-end 2025 --plan freepython scripts/fetch_recent_games.py --days-back 3 --write-to-dbpython scripts/build_dataset.py --config-path configs/default.ymlpython scripts/train_gbm.py --config configs/default.ymlpython scripts/train_gbm_timeseries.py --config-path configs/default.ymlpython scripts/tune_gbm.py --max-trials 30python scripts/feature_prune.py --config configs/default.yml --grouped truepython scripts/evaluate.py --dataset-path data/processed/pregame_dataset_latest.parquetpython scripts/neuroevo_run.py --dataset-path data/processed/pregame_dataset.parquet --generations 5uvicorn src.api.service:app --reload
- Spin up Postgres (example):
docker run --rm -p 5432:5432 -e POSTGRES_PASSWORD=postgres -e POSTGRES_USER=postgres -e POSTGRES_DB=nba postgres:16 - Set
DATABASE_URL. - Init tables:
python scripts/setup_db.py. - Ingest:
python scripts/fetch_games.py --season-start 2024 --season-end 2025 --write-to-dbandpython scripts/build_dataset.py --config-path configs/default.yml --write-to-db --db-chunk-size 500. - Push artifacts:
python scripts/push_artifacts_to_db.py. - Verify:
python scripts/check_db.py. - Daily refresh/retrain:
python scripts/fetch_recent_games.py --days-back 3 --write-to-dbandpython scripts/auto_update_model.py(primarynba_api, fallback Balldontlie).
- Run
ruff check .andpytestbefore submitting changes. - Frontend lint:
cd src/web/next-app && npm run lint; tests:npm test; e2e:npm run e2e. - Keep data directories empty in git; artifacts live in
artifacts/andreports/(ignored). - See
docs/monitoring.mdfor performance/data-quality checks.
- Env-based:
FEATURE_<NAME>=on/true/1; helper atsrc/utils/feature_flags.py(none active).
- Shapes/auth in
docs/api_contract.md.
- Secrets stay out of git; logs mask secret values except last four chars.