HazardMapper is an open-source tool designed to analyze, process, and model hazards based on geospatial conditioning factors datasets. It includes components for data preprocessing, partitioning, model training and evaluation, hyperparameter sweeps, map generation and SHAP explanations making it easier to generate hazard maps for various regions.
-
HazardMapper/
- Contains the main source code including modules for analysis (analysis.py), architecture (architecture.py), dataset management (dataset.py), modeling (model.py), partitioning (partition.py), preprocessing (preprocess.py), and various utility functions (utils.py).
-
Experiments/
- Includes shell scripts for running experiments (e.g. run.sh and preprocess.sh).
-
Clone the repository:
git clone https://github.com/your_username/HazardMapper.git cd HazardMapper -
Set up the environment:
If using conda, run:
conda env create -f environment.yml conda activate hazardmapper
-
Install the package locally:
pip install -e .
- Data:
To use this package, a data directory needs to be provided in the dataset.py script, with the npy arrays for modelling. By default it will look in
Input/Europe/. - Partition Map
For the partition.sh script to run, it needs a
partition_map/sub_countries_rasterized.npyin the data folder.
To enable experiment tracking and logging with Weights & Biases, you need to log in with your API key. Follow these steps:
-
Obtain Your API Key:
- Sign up or log in at Weights & Biases.
- Navigate to your account settings and copy your API key.
-
Login Via Command-Line:
- Run the following command in your terminal:
wandb login YOUR_API_KEY_HERE
- Replace
YOUR_API_KEY_HEREwith your actual API key.
- Run the following command in your terminal:
-
Verify Login:
- Once logged in, your experiments will automatically sync with your wandb account.
Logging in ensures that metrics, model checkpoints, and other experiment details are stored and visualized online.
-
Data Preprocessing:
Run the preprocessing script to prepare your datasets:bash Experiments/preprocess.sh
-
Run Experiments:
Start an experiment run with:bash Experiments/run.sh
-
Downscalling: For easier testing and development, downscaling the data is suggested. The
utils.pyscript does this downscaling and should be run after preprocessing and partitioning. -
Snellius: For usage on snellius, clone the git or transfer the
HazardMapper/andExperiments/directories in theSuceptibility/directory. Create the environment using the instructions above and run the experiments withsbatch Experiments/example.sh.
The dataset.py module defines a custom dataset and helper functions to load hazard-specific features and labels as image patches for model training and evaluation. Key features include:
-
Path Configuration:
All file paths for raw inputs, preprocessed variables, hazard maps, and partition maps are defined in structured dictionaries (e.g.,raw_paths,var_paths,hazard_map_paths,label_paths, and their downscaled versions). This ensures consistency across the pipeline and makes it easy to update data locations. -
Custom Dataset Class (
HazardDataset):
This class extends PyTorch'sDatasetto:- Load multiple continuous and categorical features as channels.
- Handle patch extraction from large geospatial arrays by applying appropriate padding.
- Binarize hazard labels (except for multi-hazard cases).
- Validate input by ensuring that the specified hazard and variable names exist in the defined paths.
The preprocess.py module provides a pipeline for cleaning and transforming raw geospatial data stored in .npy files. The main tasks performed by this module include:
-
Cleaning Maps:
Theclean_mapfunction masks water bodies and out-of-bounds areas by setting their values to NaN based on landcover and elevation data. -
Normalization:
Thenormalizefunction usesMinMaxScalerto scale data values to the range [0, 1]. This is applied to most input variables to prepare them for subsequent modeling stages. -
Label Encoding:
Thelabel_encodefunction converts categorical data (e.g., landcover types) into numerical labels while preserving NaN values for missing data.
When executed, the module:
- Loads raw data files for various environmental variables from predefined paths.
- Applies a log transformation to selected variables (e.g., elevation, slope) to handle skewed distributions.
- Normalizes all variables and, if needed, applies label encoding for categorical variables.
- Cleans each variable by masking water bodies and out-of-bounds areas based on the landcover and elevation data.
- Saves the processed data back as
.npyfiles to specified output paths.
To run the entire preprocessing pipeline, simply execute:
python HazardMapper/preprocess.pyThis script will iterate through the predefined list of variables, apply the necessary transformations, and save the processed files for later use in model training and evaluation.
The partition.py module handles the creation and management of partition maps for hazard data in Europe. It enables you to:
- Filter Hazard Occurrences: Only include regions with hazard data.
- Erode Partition Borders: Use binary erosion (with a configurable kernel size) to remove border cells and reduce data leakage during patch sampling.
- Balance Partitions: Downsample non-hazard cells to balance the dataset within each split (train, validation, test).
- Sample the Partition Map: Randomly select a subset of partition samples to match a desired sample size.
The module uses Python’s argparse to define command-line arguments that let you customize the partition mapping process. For example:
-z/--hazardspecifies the hazard type (e.g., flood, wildfire, landslide) for which the partition map is generated.-n/--n_samplessets the number of samples to downsample the partition map, with a default of 1,000,000.
The model.py module is the core of HazardMapper’s modeling functionality. It provides the classes and functions needed to build, train, evaluate, and interpret both traditional and deep learning hazard susceptibility models.
-
Argument Parsing:
The module uses Python’sargparseto define command-line arguments that let you customize the model configuration and training process. For example: -
-n/--name:
Sets the experiment name.
Default:"HazardMapper" -
-z/--hazard:
Specifies the hazard type. This could be set to things like"landslide","wildfire", or"flood".
Default:"landslide" -
-b/--batch_size:
Defines the batch size for training, i.e., the number of samples processed simultaneously.
Default:1024 -
-p/--patch_size:
Determines the patch size for the model input. This refers to the dimensions of the input data patches provided to the model.
Default:5 -
-a/--architecture:
Selects the model architecture. The options include:- Baseline models:
"LR","RF","MLP" - Deep learning architectures:
"CNN","SimpleCNN","SpatialAttentionCNN","CNN_GAP","CNN_GAPatt"Default:"CNN"
- Baseline models:
-
-e/--epochs:
Specifies the number of training epochs, which is the number of full passes through the training dataset.
Default:5 -
--sweep:
A flag (boolean) to enable hyperparameter optimization (sweep) using tools like Weights & Biases.
Default:False
Usage: Include this flag if you wish to perform a hyperparameter sweep. -
--map:
A flag to trigger the creation of a hazard map after training. This automates the map generation process with the trained model.
Default:False -
--explain:
A flag to compute SHAP (SHapley Additive exPlanations) values for model explainability, helping to interpret model predictions.
Default:False -
Model Classes:
-
Baseline:
Implements traditional machine learning models (Logistic Regression, Random Forest, or MLP) for pixel-wise classification. Note that for baseline models, only a patch size of 1 is supported. -
HazardModel:
Implements deep learning models using PyTorch. It supports several architectures:- MLP
- CNN
- SimpleCNN
- SpatialAttentionCNN
- CNN_GAP
- CNN_GAPatt
-
-
Training and Evaluation:
The module implements a complete training pipeline:- Loading and partitioning datasets using associated data loader classes.
- Defining the model architecture and training loops.
- Monitoring training with early stopping and logging metrics.
- Saving the best model and exporting to ONNX format.
- Evaluating model performance with metrics such as accuracy, precision, recall, F1 score, AUROC, average precision, and MAE.
-
Advanced Features:
-
Hyperparameter Optimization (Sweep):
You can enable a hyperparameter sweep (using Weights & Biases) with the--sweepflag. This is supported only for PyTorch-based architectures. -
Hazard Map Generation:
By specifying the--mapflag, the module creates a hazard susceptibility map for the region using model predictions. -
Model Explanation:
For deep learning models, the--explainflag computes SHAP values to provide model explainability.
-
-
Model Manager:
TheModelMgrclass is responsible for:- Configuring and coordinating the different model types.
- Managing output directories, logging, and folder structure for saving results.
- Integrating evaluation, model saving, and logging results (both locally and to Weights & Biases).
An example command to run a training instance with the desired configuration looks like:
python HazardMapper/model.py -n "MyExperiment" -z "landslide" -b 1024 -p 5 -a "SimpleCNN" -e 5 --explainThe architecture.py module provides various architectures for hazard susceptibility modeling. These include both deep learning models built with PyTorch and traditional baseline models implemented with scikit-learn. The available architectures are:
-
Baseline Models (scikit-learn):
- Logistic Regression (LR):
Implements a logistic regression model for pixel-wise classification. It is simple and interpretable, but only supports a patch size of 1. - Random Forest (RF):
Uses an ensemble of decision trees for robust classification. Like LR, it only supports pixel-wise classification with a patch size of 1.
- Logistic Regression (LR):
-
Deep Learning Models (PyTorch):
- MLP:
A fully connected neural network designed to process 1D feature vectors. Useful as a baseline for non-spatial inputs. - CNN:
A basic convolutional neural network that applies convolutional filters to capture spatial patterns in input patches. Note: This architecture is also among the most stable options. - CNNatt:
A variant that applies spatial attention after the convolutional layers for improved feature emphasis. - SimpleCNN:
A lightweight CNN that balances complexity and performance, designed for patch-based spatial data. - SpatialAttentionCNN:
Incorporates a spatial attention mechanism to focus on important areas in the input data. - CNN_GAP:
A CNN with Global Average Pooling (GAP) to create robust, patch-size agnostic representations.
Note: This architecture is one of the most stable models in the system. - CNN_GAPatt:
Combines convolutional feature extraction with an attention mechanism and GAP.
- MLP:
Depending on your experimental requirements and the nature of your input data, you can select the appropriate architecture. For robust and stable deep learning performance, CNN_GAP and CNN are recommended, while LR and RF serve as quick and interpretable baselines using scikit-learn.
The utils.py module provides a collection of utility functions for handling, processing, and visualizing geospatial data. These functions support various preprocessing and plotting tasks needed throughout the HazardMapper pipeline. Key functionalities include:
-
Downscaling Maps:
Thedownscale_map(path)function downsamples a raster map (stored as a NumPy array) by a fixed factor (default is 10) and saves the new downscaled map as a.npyfile. This is useful for faster processing and visualization. -
Converting Raster Files:
Thetif_to_npy(tif_file, npy_file)function converts a.tiffile into a NumPy array using therasteriolibrary. This conversion allows you to work with standard array operations and integrate the data into the processing pipeline. -
Water Mask Creation:
Themake_water_mask(downsample_factor=1)function generates a binary water mask from the landcover raster. It marks pixels corresponding to water (identified by the landcover value of 210) and saves this mask for reuse. -
Plotting Individual Maps:
Theplot_npy_arrays(...)function allows for detailed visualization of a NumPy array over a map. It supports:- Downsampling to manage very large arrays.
- Logarithmic transformation of data.
- Debugging support (by distinguishing NaNs).
- Overlaying additional features such as water masks.
- Automatic inference of plot titles, names, and types based on the input file.
-
Grid Plotting of Conditioning Factors:
Theplot_maps_grid(...)function generates a multi-panel (e.g., 4×4) grid of maps, each displaying different conditioning factors. This function handles:- Consistent downsampling and layout of subplots.
- Customizable color maps and grid configuration.
- Shared axis labels and a common colorbar across all panels.
-
Data Normalization:
Thenormalize_label(hazard_map, threshold=0.99)function normalizes hazard maps to the [0, 1] range based on a specified percentile threshold. This normalization is essential for comparing or visualizing hazard intensities across different maps. -
Helper Functions:
Additional helper functions such as:infer_name_from_path(data)– infers a human-readable name from a file path.infer_title_from_name(name, type)– constructs a default title for a plot.infer_type(data, name)– deduces the type of data (continuous, partition, bins, or categorical) from the array values.infer_downscaled(path)– determines whether a file is downscaled based on its filename.
The module also supports command-line execution for common tasks. For example:
-
Downscaling Raster Maps:
python HazardMapper/utils.py --downscale
This command downscales all maps defined in your variable and label paths.
-
Plotting a Grid of Maps:
python HazardMapper/utils.py --plot_grid
This command generates and saves a grid of maps that display various environmental conditioning factors.
-
Plotting a Single Map:
python HazardMapper/utils.py --plot path/to/your_file.npy
This command displays the plot for a specific
.npyfile, applying any inferred or specified visualization settings.
HazardMapper is open source and available under the The GNU General Public License v3.0.