An end-to-end geospatial machine learning pipeline that predicts wildfire ignition risk across California at the level of a 4 km grid cell and weekly time step. The system fuses 7+ heterogeneous datasets — satellite weather grids, fuel rasters, census demographics, fire history, and national park visitation — into a unified modeling dataset of ~2.4 million rows, trains classification models, and renders predictions as interactive, zoomable risk heatmaps.
Raw Data (7 sources) Grid (4 km cells) ML Pipeline Risk Heatmap
┌─────────────────┐ ┌───────────────┐ ┌──────────────┐ ┌──────────────┐
│ FPA-FOD fires │──┐ │ │ │ 2.4M rows │ │ Interactive │
│ gridMET weather │──┤ │ ┌──┬──┬──┐ │ │ 6 features │ │ folium map │
│ LANDFIRE fuels │──┤──────▶│ ├──┼──┼──┤ │──────▶ │ LogReg + RF │─────▶│ + fire pts │
│ Census/ACS pop │──┤ │ ├──┼──┼──┤ │ │ ROC-AUC eval │ │ + colorbar │
│ CalFire perims │──┤ │ └──┴──┴──┘ │ │ Feature imp. │ │ per week │
│ NPS parks+visits│──┘ │ │ └──────────────┘ └──────────────┘
└─────────────────┘ └───────────────┘
- Features
- Data Pipeline
- Data Sources
- Tech Stack
- Repository Structure
- Getting Started
- Pipeline Walkthrough
- Model Performance
- Interactive Heatmap
- Future Work
- License
- Multi-source geospatial data fusion — integrates vector (shapefiles, GeoPackage), raster (GeoTIFF, NetCDF), and tabular (API, CSV) datasets into a single analysis grid
- 4 km spatial resolution — analysis grid derived from gridMET cell geometry, covering all of California (~30,000+ cells)
- Weekly temporal resolution — predictions at the
(cell_id, week)level for fine-grained temporal risk tracking - 6 engineered features — 7-day rolling mean temperature, population density, fuel type, years since last fire, distance to nearest park, log-transformed park visitor counts
- Class-imbalance-aware ML — logistic regression and random forest classifiers with balanced class weights to handle the ~99.5% / 0.5% imbalance
- Time-based train/test split — trains on 2019, tests on 2020 to prevent data leakage
- Comprehensive evaluation — ROC-AUC, PR-AUC, Brier score, confusion matrices, and feature importance analysis
- Interactive risk heatmaps — zoomable folium maps with risk intensity layers, actual fire point overlays, and color legends for any selected week
- Static EDA visualizations — distribution plots, feature vs. target comparisons, and geographic maps via matplotlib and seaborn
- Reproducible notebook — single Jupyter notebook walks from raw data download through final heatmap in 9 clearly labeled sections
Section 1: Data Acquisition
│ Load 7 datasets: FPA-FOD, gridMET, LANDFIRE, Census, CalFire, NPS parks, visitor stats
▼
Section 2: Grid Definition
│ Build ~30K+ polygons (4 km cells) from gridMET geometry → ca_grid_cells.gpkg
▼
Section 3: Fire Labels
│ Spatial join fires → cells, aggregate to (cell_id, week) → labels_cell_week_full.parquet
▼
Section 4: Weather Features
│ 7-day rolling mean tmmx, weekly resample → weather_features_cell_week.parquet
▼
Section 5: Human & Geographic Features
│ Population density, fuel mode, burn history, park distance, visitor counts
│ → ca_grid_data_done.gpkg
▼
Section 6: Modeling Dataset Assembly
│ Merge labels + weather + static features → modeling_dataset_2019_2020.parquet (2.4M rows)
▼
Section 7: Exploratory Data Analysis
│ Feature distributions, class balance, geographic visualizations
▼
Section 8: Model Training & Evaluation
│ Logistic Regression + Random Forest → metrics + feature importance
▼
Section 9: Risk Heatmap
Score all cells → risk_scores_cell_week.parquet → interactive folium maps
Note: Raw data is not checked into this repo due to size and licensing. The notebook documents where to download each dataset and how to store it under
data/raw/.
| Dataset | Source | Format | Purpose |
|---|---|---|---|
| Wildfire Ignitions | FPA-FOD (U.S. Forest Service) | GeoPackage | Historical fire discovery locations and dates (CA, 2000–2020, ≥100 acres) |
| Fire Perimeters | CAL FIRE / FRAP | GeoPackage | Burn history polygons for years-since-last-fire feature |
| Weather (Temperature) | gridMET | NetCDF | Daily max temperature (tmmx) at 4 km resolution |
| Population Density | Census TIGER/Line + ACS API | Shapefile + API | Census tract geometries + population estimates |
| Land Cover / Fuels | LANDFIRE | GeoTIFF | Fire behavior fuel model raster (FBFM13) |
| Park Boundaries | NPS | GeoPackage | National Park unit boundaries for California |
| Park Visitor Counts | Melanie Walsh Dataset | CSV | Annual recreation visits per park (1979–2024) |
| Category | Libraries |
|---|---|
| Geospatial | geopandas, shapely, rasterio, rioxarray, pyproj, fiona |
| Raster & Gridded Data | xarray, rioxarray, rasterio, dask |
| Data Processing | pandas, numpy, pyarrow |
| Machine Learning | scikit-learn (LogisticRegression, RandomForestClassifier, StandardScaler) |
| Interactive Mapping | folium, branca (HeatMap plugin, colormaps) |
| Visualization | matplotlib, seaborn |
| Environment | Conda (geo environment), Python 3.11 |
wildfire-proj/
├── wildfire_proj.ipynb # Main notebook — full pipeline from raw data to heatmap
├── README.md # This file
├── data/
│ ├── raw/ # Downloaded datasets (not in version control)
│ │ ├── fpa_fod/ # FPA_FOD_20221014.gpkg
│ │ ├── perim/ # CalFire perimeter polygons
│ │ ├── ca_boundry/ # California state boundary shapefile
│ │ ├── tract_info/ # Census TIGER/Line tracts
│ │ ├── gridmet/ # gridMET NetCDF files
│ │ ├── parks/ # NPS boundary GeoPackage
│ │ └── landfire/ # LANDFIRE fuel model GeoTIFF
│ └── processed/ # Notebook outputs (not in version control)
│ ├── ca_grid_cells.gpkg
│ ├── labels_cell_week_full.parquet
│ ├── weather_features_cell_week.parquet
│ ├── ca_grid_data_done.gpkg
│ ├── modeling_dataset_2019_2020.parquet
│ └── risk_scores_cell_week.parquet
└── environment.yml # Conda environment export (optional)
- Conda (Miniconda or Anaconda)
- ~10 GB disk space for raw datasets
- Jupyter Notebook or JupyterLab
-
Clone the repository
git clone https://github.com/sohan-shingade/wildfire-proj.git cd wildfire-proj -
Create the conda environment
conda create -n geo python=3.11 conda activate geo conda install -c conda-forge \ geopandas rasterio rioxarray xarray netcdf4 \ shapely fiona pyproj scikit-learn \ folium branca matplotlib dask pyarrow pip install seaborn
-
Download the datasets
Follow the links in the Data Sources table and place each dataset under
data/raw/in the folder structure shown above. -
Run the notebook
jupyter notebook wildfire_proj.ipynb
Execute cells sequentially — each section builds on the outputs of previous sections.
Builds a uniform 4 km analysis grid from gridMET's native cell geometry. Each cell gets a unique cell_id used as the join key throughout the pipeline. Saved as ca_grid_cells.gpkg.
Fire ignition points from FPA-FOD are spatially joined to grid cells (gpd.sjoin with within), then aggregated to (cell_id, week) pairs. The binary target fire_occurred is set to 1 if any fire ignited in that cell-week. The resulting dataset has ~2.4M rows with extreme class imbalance (~99.5% negative).
Daily gridMET maximum temperature is processed through a 7-day rolling mean, then resampled to weekly aggregates per grid cell. This captures antecedent heat conditions that drive fire risk.
Five static features are computed per grid cell:
| Feature | Method |
|---|---|
pop_density |
Census tract population / area, joined by cell centroid |
fuel_mode |
Dominant LANDFIRE fuel class via raster zonal statistics |
years_since_last_fire |
Most recent CalFire perimeter year, subtracted from reference |
dist_to_park_m |
Distance from cell centroid to nearest NPS park boundary |
log_visits |
Log-transformed annual visitor count of the nearest park |
All features are merged into a flat modeling table. The pipeline uses a time-based split (train on 2019, test on 2020) to prevent leakage, applies StandardScaler, and fits two models:
- Logistic Regression —
class_weight='balanced', max 1000 iterations - Random Forest — 200 trees,
class_weight='balanced_subsample', no max depth cap
The chosen model scores every (cell_id, week) pair with ignition probability. These risk scores power both static matplotlib maps (quantile-normalized) and interactive folium heatmaps with fire point overlays.
| Metric | Logistic Regression | Random Forest |
|---|---|---|
| ROC-AUC | Reasonable | Higher |
| PR-AUC | Low (expected) | Low (expected) |
| Brier Score | Computed | Computed |
PR-AUC is inherently low due to extreme class imbalance (~0.5% positive rate). The probability threshold is tuned to 0.002 to maximize recall for fire events.
Top features by Random Forest importance:
tmmx_7day_mean(temperature — dominant predictor)pop_densityfuel_modeyears_since_last_firedist_to_park_mlog_visits
The final output is an interactive folium map for any selected week:
- Risk intensity layer — cell centroids colored by predicted ignition probability
- Fire point overlay — actual fires that ignited that week shown as markers
- Colorbar legend — low-to-high risk scale using the
OrRdcolormap - Full interactivity — pan, zoom, hover, and export to HTML
# Generate a risk map for a specific week
make_risk_map_folium("2020-08-17")- Add more weather variables (VPD, wind speed, precipitation deficit)
- Incorporate elevation and slope from DEM data
- Test gradient boosting models (XGBoost, LightGBM)
- Extend temporal range beyond 2019–2020
- Build a Streamlit or Dash dashboard for real-time risk exploration
- Add NDVI/EVI vegetation indices from MODIS or Sentinel-2
This project is available under the MIT License.
Built with geospatial Python for wildfire risk research