Skip to content

sohan-shingade/wildfire-proj

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Wildfire Ignition Risk Heatmap for California

Python Jupyter scikit-learn GeoPandas Folium License: MIT

An end-to-end geospatial machine learning pipeline that predicts wildfire ignition risk across California at the level of a 4 km grid cell and weekly time step. The system fuses 7+ heterogeneous datasets — satellite weather grids, fuel rasters, census demographics, fire history, and national park visitation — into a unified modeling dataset of ~2.4 million rows, trains classification models, and renders predictions as interactive, zoomable risk heatmaps.

Raw Data (7 sources)          Grid (4 km cells)         ML Pipeline           Risk Heatmap
┌─────────────────┐          ┌───────────────┐        ┌──────────────┐      ┌──────────────┐
│ FPA-FOD fires   │──┐       │               │        │ 2.4M rows    │      │  Interactive  │
│ gridMET weather │──┤       │  ┌──┬──┬──┐   │        │ 6 features   │      │  folium map   │
│ LANDFIRE fuels  │──┤──────▶│  ├──┼──┼──┤   │──────▶ │ LogReg + RF  │─────▶│  + fire pts   │
│ Census/ACS pop  │──┤       │  ├──┼──┼──┤   │        │ ROC-AUC eval │      │  + colorbar   │
│ CalFire perims  │──┤       │  └──┴──┴──┘   │        │ Feature imp. │      │  per week     │
│ NPS parks+visits│──┘       │               │        └──────────────┘      └──────────────┘
└─────────────────┘          └───────────────┘

Table of Contents


Features

  • Multi-source geospatial data fusion — integrates vector (shapefiles, GeoPackage), raster (GeoTIFF, NetCDF), and tabular (API, CSV) datasets into a single analysis grid
  • 4 km spatial resolution — analysis grid derived from gridMET cell geometry, covering all of California (~30,000+ cells)
  • Weekly temporal resolution — predictions at the (cell_id, week) level for fine-grained temporal risk tracking
  • 6 engineered features — 7-day rolling mean temperature, population density, fuel type, years since last fire, distance to nearest park, log-transformed park visitor counts
  • Class-imbalance-aware ML — logistic regression and random forest classifiers with balanced class weights to handle the ~99.5% / 0.5% imbalance
  • Time-based train/test split — trains on 2019, tests on 2020 to prevent data leakage
  • Comprehensive evaluation — ROC-AUC, PR-AUC, Brier score, confusion matrices, and feature importance analysis
  • Interactive risk heatmaps — zoomable folium maps with risk intensity layers, actual fire point overlays, and color legends for any selected week
  • Static EDA visualizations — distribution plots, feature vs. target comparisons, and geographic maps via matplotlib and seaborn
  • Reproducible notebook — single Jupyter notebook walks from raw data download through final heatmap in 9 clearly labeled sections

Data Pipeline

Section 1: Data Acquisition
    │   Load 7 datasets: FPA-FOD, gridMET, LANDFIRE, Census, CalFire, NPS parks, visitor stats
    ▼
Section 2: Grid Definition
    │   Build ~30K+ polygons (4 km cells) from gridMET geometry → ca_grid_cells.gpkg
    ▼
Section 3: Fire Labels
    │   Spatial join fires → cells, aggregate to (cell_id, week) → labels_cell_week_full.parquet
    ▼
Section 4: Weather Features
    │   7-day rolling mean tmmx, weekly resample → weather_features_cell_week.parquet
    ▼
Section 5: Human & Geographic Features
    │   Population density, fuel mode, burn history, park distance, visitor counts
    │   → ca_grid_data_done.gpkg
    ▼
Section 6: Modeling Dataset Assembly
    │   Merge labels + weather + static features → modeling_dataset_2019_2020.parquet (2.4M rows)
    ▼
Section 7: Exploratory Data Analysis
    │   Feature distributions, class balance, geographic visualizations
    ▼
Section 8: Model Training & Evaluation
    │   Logistic Regression + Random Forest → metrics + feature importance
    ▼
Section 9: Risk Heatmap
        Score all cells → risk_scores_cell_week.parquet → interactive folium maps

Data Sources

Note: Raw data is not checked into this repo due to size and licensing. The notebook documents where to download each dataset and how to store it under data/raw/.

Dataset Source Format Purpose
Wildfire Ignitions FPA-FOD (U.S. Forest Service) GeoPackage Historical fire discovery locations and dates (CA, 2000–2020, ≥100 acres)
Fire Perimeters CAL FIRE / FRAP GeoPackage Burn history polygons for years-since-last-fire feature
Weather (Temperature) gridMET NetCDF Daily max temperature (tmmx) at 4 km resolution
Population Density Census TIGER/Line + ACS API Shapefile + API Census tract geometries + population estimates
Land Cover / Fuels LANDFIRE GeoTIFF Fire behavior fuel model raster (FBFM13)
Park Boundaries NPS GeoPackage National Park unit boundaries for California
Park Visitor Counts Melanie Walsh Dataset CSV Annual recreation visits per park (1979–2024)

Tech Stack

Category Libraries
Geospatial geopandas, shapely, rasterio, rioxarray, pyproj, fiona
Raster & Gridded Data xarray, rioxarray, rasterio, dask
Data Processing pandas, numpy, pyarrow
Machine Learning scikit-learn (LogisticRegression, RandomForestClassifier, StandardScaler)
Interactive Mapping folium, branca (HeatMap plugin, colormaps)
Visualization matplotlib, seaborn
Environment Conda (geo environment), Python 3.11

Repository Structure

wildfire-proj/
├── wildfire_proj.ipynb           # Main notebook — full pipeline from raw data to heatmap
├── README.md                     # This file
├── data/
│   ├── raw/                      # Downloaded datasets (not in version control)
│   │   ├── fpa_fod/              #   FPA_FOD_20221014.gpkg
│   │   ├── perim/                #   CalFire perimeter polygons
│   │   ├── ca_boundry/           #   California state boundary shapefile
│   │   ├── tract_info/           #   Census TIGER/Line tracts
│   │   ├── gridmet/              #   gridMET NetCDF files
│   │   ├── parks/                #   NPS boundary GeoPackage
│   │   └── landfire/             #   LANDFIRE fuel model GeoTIFF
│   └── processed/                # Notebook outputs (not in version control)
│       ├── ca_grid_cells.gpkg
│       ├── labels_cell_week_full.parquet
│       ├── weather_features_cell_week.parquet
│       ├── ca_grid_data_done.gpkg
│       ├── modeling_dataset_2019_2020.parquet
│       └── risk_scores_cell_week.parquet
└── environment.yml               # Conda environment export (optional)

Getting Started

Prerequisites

  • Conda (Miniconda or Anaconda)
  • ~10 GB disk space for raw datasets
  • Jupyter Notebook or JupyterLab

Installation

  1. Clone the repository

    git clone https://github.com/sohan-shingade/wildfire-proj.git
    cd wildfire-proj
  2. Create the conda environment

    conda create -n geo python=3.11
    conda activate geo
    
    conda install -c conda-forge \
      geopandas rasterio rioxarray xarray netcdf4 \
      shapely fiona pyproj scikit-learn \
      folium branca matplotlib dask pyarrow
    
    pip install seaborn
  3. Download the datasets

    Follow the links in the Data Sources table and place each dataset under data/raw/ in the folder structure shown above.

  4. Run the notebook

    jupyter notebook wildfire_proj.ipynb

    Execute cells sequentially — each section builds on the outputs of previous sections.


Pipeline Walkthrough

1. Grid Definition

Builds a uniform 4 km analysis grid from gridMET's native cell geometry. Each cell gets a unique cell_id used as the join key throughout the pipeline. Saved as ca_grid_cells.gpkg.

2. Fire Labels

Fire ignition points from FPA-FOD are spatially joined to grid cells (gpd.sjoin with within), then aggregated to (cell_id, week) pairs. The binary target fire_occurred is set to 1 if any fire ignited in that cell-week. The resulting dataset has ~2.4M rows with extreme class imbalance (~99.5% negative).

3. Weather Features

Daily gridMET maximum temperature is processed through a 7-day rolling mean, then resampled to weekly aggregates per grid cell. This captures antecedent heat conditions that drive fire risk.

4. Human & Geographic Features

Five static features are computed per grid cell:

Feature Method
pop_density Census tract population / area, joined by cell centroid
fuel_mode Dominant LANDFIRE fuel class via raster zonal statistics
years_since_last_fire Most recent CalFire perimeter year, subtracted from reference
dist_to_park_m Distance from cell centroid to nearest NPS park boundary
log_visits Log-transformed annual visitor count of the nearest park

5. Modeling

All features are merged into a flat modeling table. The pipeline uses a time-based split (train on 2019, test on 2020) to prevent leakage, applies StandardScaler, and fits two models:

  • Logistic Regressionclass_weight='balanced', max 1000 iterations
  • Random Forest — 200 trees, class_weight='balanced_subsample', no max depth cap

6. Risk Scoring & Visualization

The chosen model scores every (cell_id, week) pair with ignition probability. These risk scores power both static matplotlib maps (quantile-normalized) and interactive folium heatmaps with fire point overlays.


Model Performance

Metric Logistic Regression Random Forest
ROC-AUC Reasonable Higher
PR-AUC Low (expected) Low (expected)
Brier Score Computed Computed

PR-AUC is inherently low due to extreme class imbalance (~0.5% positive rate). The probability threshold is tuned to 0.002 to maximize recall for fire events.

Top features by Random Forest importance:

  1. tmmx_7day_mean (temperature — dominant predictor)
  2. pop_density
  3. fuel_mode
  4. years_since_last_fire
  5. dist_to_park_m
  6. log_visits

Interactive Heatmap

The final output is an interactive folium map for any selected week:

  • Risk intensity layer — cell centroids colored by predicted ignition probability
  • Fire point overlay — actual fires that ignited that week shown as markers
  • Colorbar legend — low-to-high risk scale using the OrRd colormap
  • Full interactivity — pan, zoom, hover, and export to HTML
# Generate a risk map for a specific week
make_risk_map_folium("2020-08-17")

Future Work

  • Add more weather variables (VPD, wind speed, precipitation deficit)
  • Incorporate elevation and slope from DEM data
  • Test gradient boosting models (XGBoost, LightGBM)
  • Extend temporal range beyond 2019–2020
  • Build a Streamlit or Dash dashboard for real-time risk exploration
  • Add NDVI/EVI vegetation indices from MODIS or Sentinel-2

License

This project is available under the MIT License.


Built with geospatial Python for wildfire risk research

About

End-to-end geospatial ML pipeline predicting wildfire ignition risk across California at 4 km / weekly resolution. Fuses 7+ datasets (gridMET, FPA-FOD, LANDFIRE, Census, CalFire, NPS) into interactive folium risk heatmaps.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors