◆ BMW LUXURY SALES ANALYTICS ◆

Production-grade analytics, econometrics & decision intelligence for the BMW luxury-car market

Econometrics · Gradient Boosting · Tabular Deep Learning · External-Data Augmentation · SHAP · Streamlit · Docker · CI/CD

▶ Open the live dashboard

Dashboard preview

Live tour: executive overview → data integrity → econometrics → ML benchmark → scenario simulator.

Executive Overview	Data Integrity

SQL Insights (DuckDB)	Econometrics

ML Benchmark	Explainability (SHAP)

Scenario Simulator	Decision under uncertainty

DuckDB SQL · interactive Plotly · SHAP explainability · Bayesian-flavoured what-if simulator with credible intervals

1. Overview

An end-to-end decision-support platform built on 15 years (2010–2024) of BMW sales records (50,000 transactions, 11 features). It pairs rigorous econometrics with modern machine learning, enriches the data with real external APIs (macro-economics, fuel prices, CO₂ regulation, FX), and ships a premium Streamlit dashboard behind a fully containerised, CI/CD-tested codebase.

Data source: the base dataset is the public BMW Sales Dataset on Kaggle by eshummalik. All external macro/fuel/CO₂/FX context is added by this project (see ADR-0003).

This project proves two things

1 — I can build a model that works. The pipeline reaches a cross-validated R² ≈ 0.85 on signal-bearing data, with SHAP recovering the true drivers — a validated model, not a lucky split. (predictive capability)

2 — I won't fake it when the data is empty. This particular dataset is structurally pristine but signal-free (every feature is statistically independent of the targets), and Sales_Classification is a leaked threshold on Sales_Volume. On it the same pipeline honestly scores R² ≈ 0 / AUC ≈ 0.5 — proven with a permutation test and a positive control, not hidden. Business value is then delivered through a clearly-labelled Scenario Simulator.

Predictive competence and intellectual honesty — that is the senior deliverable. Evidence: Predictive Capability · Data Integrity · Signal Audit · ADR-0002.

2. Headline results (honest, reproducible)

Analysis	Result	What it means
Max \|correlation\| among numeric features	0.009	Features are mutually independent noise
Price elasticity of demand (log-log, HC3)	−0.001 (p = 0.92)	No measurable price sensitivity in-sample
Hedonic price model R²	0.0004	Price is unexplained by attributes here
Regression R² (best of XGB/LGBM/CatBoost)	≈ 0.00	Boosting cannot beat the mean — no signal
Classification ROC-AUC (leakage-free)	≈ 0.51	No discriminative signal once leakage removed
Classification ROC-AUC (leak left in)	1.00	The signature of target leakage
Permutation test (label-shuffle)	p ≈ 0.90	Real score indistinguishable from chance
Predictive capability (same pipeline, signal-bearing target)	CV R² ≈ 0.85 ± 0.003	The pipeline does predict — when there is signal
Tabular MLP vs gradient boosting	both no-skill	Deep learning not justified (ADR-0004)

Reports: econometrics · model benchmark · DL vs ML.

3. Architecture

bmw-sales/
├── src/bmw_sales/
│   ├── config.py            # typed pydantic-settings + canonical DatasetSchema
│   ├── data/                # loader (schema validation) · validation (integrity report)
│   ├── audit/               # No-Signal Auditor: permutation · positive control · KS · χ²
│   ├── apis/                # hybrid real+mock clients · enrichment join
│   │   ├── base.py          #   cache + retry + circuit breaker + provenance
│   │   ├── worldbank.py · fx_rates.py · fuel_prices.py · co2_regulations.py
│   ├── features/            # domain feature engineering
│   ├── econometrics/        # OLS hedonic · demand · elasticity · VIF · leakage proof
│   ├── models/              # preprocessing · XGB/LGBM/CatBoost · tabular MLP · MLflow
│   ├── simulation/          # Scenario Simulator + Monte-Carlo uncertainty
│   ├── explainability/      # SHAP attributions
│   └── sql/                 # DuckDB analytics over sql/queries/*.sql
├── app/                     # Streamlit premium UI (theme · data_access · 7 tabs)
├── sql/queries/             # versioned analytical SQL
├── tests/                   # pytest suite (unit + integration)
├── docs/                    # MkDocs Material site + 9 ADRs
├── reports/                 # generated analyses (committed)
├── Dockerfile · docker-compose.yml · .github/workflows/{main,docs}.yml
└── Makefile · mkdocs.yml · pyproject.toml · requirements*.txt

Design rationale: ADR-0001.

Pipeline (end-to-end)

The full flow from raw data to a deployed decision-support app. The honest-analytics spine (gold) is what makes this a senior deliverable: the data is audited and proven signal-free before any model is trusted.

flowchart TB
    RAW["Raw dataset<br/>BMW_sales_data 2010–2024<br/>50,000 rows × 11 cols"]

    subgraph L1["① Data foundation · bmw_sales.data"]
        LOAD["loader.py<br/>schema validation · dtype coercion"]
        VAL["validation.py<br/>correlation · ANOVA · mutual-info · leakage"]
    end

    subgraph L2["② Signal audit · bmw_sales.audit"]
        PERM["permutation / label-shuffle test<br/>p ≈ 0.90 → no signal"]
        CTRL["positive control<br/>synthetic R² ≈ 0.86 vs real ≈ 0"]
        KS["KS-uniformity · χ² independence"]
    end

    subgraph L3["③ External augmentation · bmw_sales.apis"]
        WB["WorldBank · FX<br/>real endpoints"]
        FC["Fuel · CO₂<br/>mock-first"]
        BASE["base.py<br/>cache → retry → circuit-breaker → mock"]
        ENR["enrichment.py<br/>region × year × fuel panel join"]
    end

    subgraph L4["④ Features · bmw_sales.features"]
        FE["engineering.py<br/>age · usage · electrified · log transforms"]
    end

    subgraph L5["⑤ Modelling"]
        ECON["econometrics<br/>hedonic OLS · elasticity (HC3) · leakage proof"]
        ML["ml_models<br/>XGBoost · LightGBM · CatBoost + RandomizedSearchCV"]
        DL["dl_models<br/>PyTorch tabular MLP (early stopping)"]
    end

    subgraph L6["⑥ Decision intelligence"]
        SIM["simulation<br/>elasticity scenario + Monte-Carlo CIs"]
        SHAP["explainability<br/>SHAP attributions"]
        SQL["sql · DuckDB<br/>region · price · YoY · electrification"]
    end

    REPORTS[("reports/<br/>integrity · signal_audit · econometric<br/>model_benchmark · dl_vs_ml · sql_insights")]
    MLF[("MLflow<br/>./mlruns")]
    ART[("models/*.joblib")]

    APP["Streamlit app · 7 tabs<br/>Overview · Integrity · SQL · Econometrics<br/>ML · SHAP · Scenario Simulator"]

    RAW --> LOAD --> VAL
    RAW --> L2
    RAW --> SQL
    LOAD --> ENR
    WB & FC --> BASE --> ENR
    VAL --> FE
    ENR --> FE
    FE --> ECON & ML & DL
    ML --> ART
    ML --> SHAP
    ENR -. macro baselines .-> SIM
    L2 --> REPORTS
    ECON & ML & DL & SQL --> REPORTS
    ML --> MLF
    REPORTS --> APP
    SIM & SHAP & SQL --> APP

    classDef honest fill:#241f08,stroke:#D4AF37,stroke-width:2px,color:#fff;
    classDef store fill:#15151a,stroke:#8FA9C7,color:#cfe;
    class RAW,L1,L2,FE honest;
    class REPORTS,MLF,ART store;

Hybrid-API resilience (offline-safe by design)

Every external client degrades gracefully, so CI/Docker run with no network or keys yet the real path is proven live.

flowchart LR
    REQ["client.fetch(region, years)"] --> C{"disk cache hit?"}
    C -- yes --> HIT["return cached<br/>provenance = cache"]
    C -- no --> OFF{"offline mode<br/>or breaker open?"}
    OFF -- yes --> MOCK["deterministic mock<br/>provenance = mock"]
    OFF -- no --> LIVE["HTTP GET + retry/backoff"]
    LIVE -- success --> SAVE["cache + return<br/>provenance = live"]
    LIVE -- failure --> TRIP["trip circuit-breaker"] --> MOCK

Delivery — tests, CI/CD & deployment

flowchart LR
    DEV["commit on feature/* branch"] --> PC["pre-commit<br/>black · isort · flake8 · mypy"]
    PC --> PUSH["push → main"]
    PUSH --> CI{"GitHub Actions"}
    CI --> Q["quality (3.11 / 3.12)<br/>black · isort · flake8 · mypy<br/>pytest + 62% coverage gate"]
    CI --> SEC["pip-audit"]
    Q --> DK["Docker build + Trivy scan"]
    PUSH --> DOCS["MkDocs build"]
    DK -. image .-> HF["HF Spaces (Docker)<br/>live app"]
    DOCS --> GP["GitHub Pages<br/>docs site"]

4. External-data augmentation (hybrid: real + mock)

Four sources mapped to the six regions via official World Bank aggregate codes (EAS, NAC, MEA, LCN, EMU, SSF) and representative currencies/countries. Every client caches responses, retries with backoff, and trips a circuit breaker to a deterministic mock on failure — so the project runs fully offline yet three of the four sources are validated live against real APIs.

Source	Status	Real endpoint	Signal it adds
World Bank macro	🟢 real	inflation `FP.CPI.TOTL.ZG`, GDP/cap `NY.GDP.PCAP.CD`	regional purchasing power
FX rates	🟢 real	exchangerate.host	local-currency price normalisation
CO₂ emissions	🟢 real	World Bank CO₂/capita `EN.GHG.CO2.PC.CE.AR5`	the electrification transition
Fuel prices	🟡 mock-first	WB pump-price `EP.PMP.SGAS.CD` archived by WB (2024)	Petrol/Diesel vs electrified economics

Honesty applies to the data layer too: fuel stays mock-first because the World Bank archived its pump-price series — the real hook is kept and the provenance is reported as mock rather than faking it.

Details: ADR-0003.

5. The Scenario Simulator (where business value lives)

Because the data cannot forecast, decision value comes from an explicit what-if simulation — a constant-elasticity demand model with literature-grounded priors (own-price ε ≈ −0.6, income ε ≈ +1.5, fuel cross-elasticity, CO₂-regulation shift) and baselines seeded from the real macro APIs. Every driver's contribution is decomposed in a waterfall chart, and all assumptions are adjustable in the UI. It is never presented as a fit to the historical data.

6. Quickstart

# Install (dev includes linting, tests, torch for the DL benchmark)
make install-dev                 # or: pip install -r requirements-dev.txt

make eda                         # regenerate the Data Integrity Report
make pipeline                    # train & benchmark all models (writes reports/)
make test                        # full suite, offline & deterministic
make app                         # launch the dashboard → http://localhost:8501

Docker

docker compose up --build        # → http://localhost:8501

Or pull the published image from the GitHub Container Registry (built, scanned and pushed by CI on every main update):

docker run -p 8501:8501 ghcr.io/maxime2476/bmw-sales-analytics:latest

Managed deployment (Streamlit Community Cloud or Hugging Face Spaces): see DEPLOYMENT.md.

On Windows + Anaconda, KMP_DUPLICATE_LIB_OK=TRUE is set in-code to avoid the known OpenMP (libiomp5md.dll) clash when importing PyTorch.

7. Quality & engineering

Typed (PEP 484) and mypy-clean across the src/ package.
Formatted & linted: black, isort, flake8 — all clean; pre-commit hooks run the same gates locally.
Tested: a pytest suite behind a coverage gate of ≥ 62% (live status in the CI and Codecov badges above) — schema, leakage, mock determinism & circuit-breaker fallback, leakage-aware splits, signal audit, predictive capability, Monte-Carlo simulator, SQL layer, report builders; real-data checks marked integration. A guard test keeps this gate in sync between the README and CI.
Security: pip-audit dependency scan, Trivy image scan, and Dependabot updates (pip · actions · docker).
SQL analytics: decision queries in sql/queries/ executed by DuckDB directly over the CSV (window functions, quantiles, YoY) — make sql.
Advanced uncertainty & causality: conformal prediction (calibrated, distribution-free intervals that honestly widen to ~95% of range on the signal-free data — make conformal), a causal price→demand analysis via backdoor adjustment under an explicit DAG (make causal), and an optional Claude-powered scenario narrator with a deterministic offline fallback.
Experiment tracking: every benchmarked model is logged to MLflow (mlflow ui --backend-store-uri ./mlruns).
Docs site: MkDocs Material (ADRs + auto API reference) auto-deployed to GitHub Pages.
CI/CD: GitHub Actions — lint + type + test matrix (3.11/3.12) with a coverage gate → cached Docker build + Trivy scan. See ADR-0005, ADR-0007.

8. Business insights for decision-makers

This dataset cannot price or forecast. Any model claiming high accuracy on it is either leaking the target or overfitting noise — a useful red-flag heuristic for reviewing vendor models.
Pricing & go-to-market must lean on external signals (regional income, fuel economics, CO₂ regulation) — exactly what the Simulator operationalises.
The electrification transition is the real story: regulation stringency, not historical volume, should drive the Petrol→Electric portfolio mix.

9. Architecture Decision Records

ADR	Decision
0001	Architecture & stack
0002	Data-integrity finding & honest-modelling strategy
0003	Hybrid external-data augmentation
0004	DL tested, not assumed
0005	Containerisation & CI/CD
0006	Statistical signal audit & positive control
0007	SQL analytics & hardened quality gates
0008	Decision-making under uncertainty (Monte-Carlo)
0009	Experiment tracking & published docs site

10. Author

Maxime GOURGUECHON — maxime.gourguechon76@gmail.com

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github		.github
.streamlit		.streamlit
app		app
data		data
docs		docs
models		models
notebooks		notebooks
reports		reports
sql/queries		sql/queries
src/bmw_sales		src/bmw_sales
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
packages.txt		packages.txt
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

◆ BMW LUXURY SALES ANALYTICS ◆

Production-grade analytics, econometrics & decision intelligence for the BMW luxury-car market

▶ Open the live dashboard

Dashboard preview

1. Overview

This project proves two things

2. Headline results (honest, reproducible)

3. Architecture

Pipeline (end-to-end)

Hybrid-API resilience (offline-safe by design)

Delivery — tests, CI/CD & deployment

4. External-data augmentation (hybrid: real + mock)

5. The Scenario Simulator (where business value lives)

6. Quickstart

Docker

7. Quality & engineering

8. Business insights for decision-makers

9. Architecture Decision Records

10. Author

License

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

◆ BMW LUXURY SALES ANALYTICS ◆

Production-grade analytics, econometrics & decision intelligence for the BMW luxury-car market

▶ Open the live dashboard

Dashboard preview

1. Overview

This project proves two things

2. Headline results (honest, reproducible)

3. Architecture

Pipeline (end-to-end)

Hybrid-API resilience (offline-safe by design)

Delivery — tests, CI/CD & deployment

4. External-data augmentation (hybrid: real + mock)

5. The Scenario Simulator (where business value lives)

6. Quickstart

Docker

7. Quality & engineering

8. Business insights for decision-makers

9. Architecture Decision Records

10. Author

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages