Wildfire Analysis & Prediction for California

Course: IS-517: Methods of Data Science Team: Apurva Malpure · Prithvish Taukari · Sai Kiran Billa · Yashvi Bhatt

Overview

Wildfires in California are a growing threat to life, property, and the environment. This project applies machine learning to two primary California wildfire datasets to:

Predict daily wildfire ignition risk from weather and seasonal signals.
Classify unknown fire causes (≈31 % of records) as human-made or natural.
Predict fire duration using weather and temporal features.
Identify geographic wildfire hotspots and risk zones across the state.

Motivation

More than 31 % of CALFIRE incident records carry unknown or undocumented causes. Environmental factors such as temperature, wind speed, precipitation, and season interact non-linearly, making rule-based approaches inadequate. By combining supervised classification, regression, and spatial clustering we aim to turn raw incident logs into actionable risk intelligence.

Required Datasets

Three datasets are required to reproduce this project. They are not tracked in this repository due to file size. Place each file in the data/ folder before running any script.

File	Required by	Where to obtain
`fires_with_max_weather_corrected.csv`	All RQ2 scripts, RQ3, RQ4, Rmd	Not publicly available. Team-engineered from CALFIRE perimeters + weather APIs. Contact the course instructor or project team to obtain this file.
`CA_Weather_Fire_Dataset_1984-2025.csv`	`research-question-4.R`, Rmd (RQ1)	Publicly available on Zenodo. https://zenodo.org/records/14712845
`CA_Weather_Fire_Dataset_2008_2023_with_duration.csv`	Rmd (RQ3 SVM duration analysis)	Not publicly available. Team-engineered from CALFIRE perimeters with duration calculations applied. Contact the course instructor or project team.

Engineering steps already applied to fires_with_max_weather_corrected.csv:

FIRE_DURATION calculated from alarm and containment dates.
Latitude / Longitude extracted from polygon centroids.
Temperature and wind data integrated via external APIs.

Pre-generated Outputs (Committed to Repository)

The following files are the outputs of slow training runs and are committed so the report can be re-knitted without re-running the full training pipeline.

File	Generated by	Purpose
`models/5_fold_rf.rds`	`Method1_Final_model.R`	RQ2 Method 1 — trained Random Forest
`models/xgb_model.rds`	`Method1_Final_model.R`	RQ2 Method 1 — trained XGBoost
`models/svm_model_weight1.5.rds`	`Method2_Complete.R`	RQ2 Method 2 — trained SVM (RBF, weight 1.5)
`data/cause14_dual_model_predictions_5fold.csv`	`Method1_Final_model.R`	RQ2 Method 1 — per-record raw predictions
`data/cause14_final_ensemble_classification.csv`	`Method1_Final_model.R`	RQ2 Method 1 — final ensemble classifications
`data/cause14_binary_predictions_with_confidence.csv`	`Method2_Complete.R`	RQ2 Method 2 — binary predictions with probabilities

Repository Structure

project-root/
│
├── data/                                          # Input datasets + intermediate outputs
│   ├── fires_with_max_weather_corrected.csv       # ⚠ not tracked — add manually
│   ├── CA_Weather_Fire_Dataset_1984-2025.csv      # ⚠ not tracked — add manually
│   ├── CA_Weather_Fire_Dataset_2008_2023_with_duration.csv  # ⚠ not tracked — add manually
│   ├── cause14_dual_model_predictions_5fold.csv   # pre-generated, committed
│   ├── cause14_final_ensemble_classification.csv  # pre-generated, committed
│   └── cause14_binary_predictions_with_confidence.csv       # pre-generated, committed
│
├── src/                                           # All analysis scripts
│   ├── Seasonal_trends.R                          # Monthly / seasonal trend plots
│   ├── NaturalvsHumanMade_Piechart.R              # RQ2 cause-type pie chart
│   ├── Pie_Chart_Classification_by_class.R        # RQ2 cause distribution donut
│   ├── Method1_Final_model.R                      # RQ2 Method 1 — RF + XGBoost ensemble
│   ├── Method1_Pie_Chart.R                        # RQ2 Method 1 — classification outcome pie
│   ├── Method2_Complete.R                         # RQ2 Method 2 — binary SVM
│   ├── Method-2_Plots.R                           # RQ2 Method 2 — additional SVM plots
│   ├── research-question-4.R                      # RQ1 — weather-based RF fire-day predictor
│   ├── Fire_duration_Plot.R                       # RQ3 — fire duration distribution plot
│   └── Spearman_Plot.R                            # EDA — Spearman correlation heatmap
│
├── models/                                        # Saved model objects (.rds)
│   ├── 5_fold_rf.rds
│   ├── xgb_model.rds
│   └── svm_model_weight1.5.rds
│
├── figures/                                       # Static images referenced by the report
│   ├── figure1_spearman.png
│   ├── figure2_Seasonal.png
│   ├── figure3_duration.png
│   ├── figure4_cause_pie.png
│   ├── figure5_flowchart.png
│   ├── figure6_method1_pie.png
│   ├── figure7_natural_human.png
│   ├── figure8_weight_accuracy.png
│   ├── figure9_auc_roc.png
│   ├── figure10_svm_grouping.png
│   ├── figure11_new_seasonal.png
│   ├── figure12_svm_example.png
│   ├── RQ3_methodology.png
│   └── Smart_flowchart.png
│
├── reports/                                       # R Markdown report and supporting files
│   ├── Group4_Project_report_Final.Rmd
│   └── float-setup.tex                            # LaTeX header for figure placement
│
├── packages.R                                     # Quick-setup entry point (also in src/)
├── .gitignore
└── README.md

Research Questions & Methods

RQ1 — Can weather variables predict wildfire occurrence?

Model: Tuned Random Forest (ranger) via caret with repeated 5-fold cross-validation and down-sampling to address class imbalance.

Features: MAX_TEMP, MIN_TEMP, PRECIPITATION, AVG_WIND_SPEED, TEMP_RANGE, WIND_TEMP_RATIO, lagged precipitation, lagged wind speed, SEASON, MONTH.

Split: Train 1984–2015 · Test 2016–2023.

Key findings: Temperature and month are the strongest predictors. The model generalises well on unseen data (2016–2023).

Figure 1 — Count of fire days by season.

RQ2 — Can unknown fire causes be classified?

Approximately 31 % of records carry cause code 14 (unknown). Two complementary methods:

Method 1 — Multiclass ensemble (RF + XGBoost)

Model	5-Fold CV Accuracy	Top Features
Random Forest	39.39 %	Max Wind Speed, Max Temp, Wind Speed, Area
XGBoost	41.24 %	Wind Speed, Duration, Temp, Area

Agreement between both models treated as a confirmed classification. ≈ 51.9 % of unknown records classified.

Figure 2 — Ensemble classification outcome for Cause-14 fires.

Method 2 — Binary SVM (RBF kernel)

Causes collapsed to Human-made vs. Natural before re-training.

Metric	Value
Accuracy	84.99 %
Sensitivity (Natural)	61.95 %
Specificity (Human-made)	94.20 %

Figure 3 — Cause breakdown | Figure 4 — SVM ROC Curve.

RQ3 — What weather factors drive fire duration?

Model: RBF-SVM with decision boundary visualisations across feature pairs.

Key findings:

Longer fires are strongly associated with higher minimum temperatures.
Peak fire durations cluster mid-year (May–July).
Wind speed shows limited independent influence on duration once temperature is controlled for.

Figure 5 — Distribution of fire duration across the dataset.

RQ4 — Where are California's wildfire hotspots?

Methods: DBSCAN / HDBSCAN for geographic clustering; K-means (k = 4) for fire-behaviour clustering on duration × temperature × wind speed; composite risk score mapped geographically.

High-risk zones identified: Southern California · Central Valley · Sacramento.

Figure 6 — Spearman correlation matrix of numeric features.

Note on Reproducibility

This repository is substantially reproducible subject to the following conditions:

Path updates required. All .R scripts contain hard-coded absolute paths (local to the original authors' machines) that must be changed to relative paths before running on any other machine. The path-replacement table is in the section below.
Re-training not required for PDF rendering. Pre-generated model objects (5_fold_rf.rds, xgb_model.rds, svm_model_weight1.5.rds) and intermediate data files are committed to the repository so the Rmd can be re-knitted immediately after supplying the three raw datasets.
Re-training is slow. Reproducing the RF (RQ1) and RF + XGBoost ensemble (RQ2) from scratch requires repeated 5-fold cross-validation and may take several hours on a standard laptop.
RQ2 Method 1 pipeline. The full RF + XGBoost training logic is contained in src/Method1_Final_model.R. The same logic is embedded as {r} code chunks inside the R Markdown report. The standalone script is the authoritative source.

How to Run With the Current Files

# ── Step 1: Install TinyTeX for PDF rendering (run once, if not installed) ───
tinytex::install_tinytex()   # skip if you already have a LaTeX distribution

# ── Step 2: Install all R packages (run once) ────────────────────────────────
source("packages.R")

# ── Step 3: Place raw datasets in data/ ──────────────────────────────────────
# data/fires_with_max_weather_corrected.csv             (contact project team)
# data/CA_Weather_Fire_Dataset_1984-2025.csv            (download from Zenodo)
# data/CA_Weather_Fire_Dataset_2008_2023_with_duration.csv  (contact project team)

# ── Step 4 (optional): Re-run individual analysis scripts ────────────────────
# All scripts in src/ use paths relative to src/ as the working directory.
# Open each script in RStudio and run it from there, OR set your working
# directory to src/ before sourcing:
setwd("src")
source("Seasonal_trends.R")
source("NaturalvsHumanMade_Piechart.R")
source("Pie_Chart_Classification_by_class.R")
source("Method1_Final_model.R")    # slow — re-trains RF + XGBoost
source("Method2_Complete.R")       # re-trains SVM, writes prediction CSV
source("research-question-4.R")   # slow — re-trains tuned Ranger RF
setwd("..")                        # return to repo root when done

# ── Step 5: Render the full report ───────────────────────────────────────────
# Run from the repo root. knitr resolves all paths relative to reports/.
rmarkdown::render(
  "reports/Group4_Project_report_Final.Rmd",
  output_format = "pdf_document",
  output_dir    = "reports/"
)

Required Packages

Install everything in one step: source("packages.R")

Package	Purpose
`ggplot2`	All visualisations
`dplyr` · `tidyr` · `tidyverse` · `readr`	Data manipulation
`lubridate`	Date/time parsing
`scales`	Axis formatting
`gridExtra` · `corrplot`	Multi-panel plots & correlation matrix
`maps`	California state outline
`sf` · `ggmap`	Spatial data handling
`caret`	Model training, cross-validation, metrics
`e1071`	SVM (RBF kernel)
`ranger` · `randomForest`	Random Forest (fast training + importance)
`xgboost`	Gradient boosting (RQ2 Method 1 ensemble)
`pROC`	ROC / AUC computation
`dbscan`	DBSCAN / HDBSCAN geographic clustering
`knitr` · `tinytex`	Report rendering

Authors

Name	Role
Apurva Malpure	Data engineering, RQ2 multiclass classification (Method 1)
Prithvish Taukari	Visualisations, RQ2 binary SVM classification (Method 2)
Sai Kiran Billa	Weather-based fire prediction, Random Forest (RQ1)
Yashvi Bhatt	Hotspot identification, DBSCAN / K-means clustering (RQ4)

IS-517: Methods of Data Science

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wildfire Analysis & Prediction for California

Overview

Motivation

Required Datasets

Pre-generated Outputs (Committed to Repository)

Repository Structure

Research Questions & Methods

RQ1 — Can weather variables predict wildfire occurrence?

RQ2 — Can unknown fire causes be classified?

Method 1 — Multiclass ensemble (RF + XGBoost)

Method 2 — Binary SVM (RBF kernel)

RQ3 — What weather factors drive fire duration?

RQ4 — Where are California's wildfire hotspots?

Note on Reproducibility

How to Run With the Current Files

Required Packages

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
figures		figures
models		models
reports		reports
src		src
.gitignore		.gitignore
README.md		README.md
packages.R		packages.R

Folders and files

Latest commit

History

Repository files navigation

Wildfire Analysis & Prediction for California

Overview

Motivation

Required Datasets

Pre-generated Outputs (Committed to Repository)

Repository Structure

Research Questions & Methods

RQ1 — Can weather variables predict wildfire occurrence?

RQ2 — Can unknown fire causes be classified?

Method 1 — Multiclass ensemble (RF + XGBoost)

Method 2 — Binary SVM (RBF kernel)

RQ3 — What weather factors drive fire duration?

RQ4 — Where are California's wildfire hotspots?

Note on Reproducibility

How to Run With the Current Files

Required Packages

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages