Skip to content

CityMind-Lab/GeoHG

Repository files navigation

GeoHG Overview

Python 3.10+ License: MIT SIGSPATIAL 2025 PyTorch PyG pip install geohg

GeoHG: Space-aware Socioeconomic Indicator Inference with Heterogeneous Graphs

English | 中文

A modular, pip-installable Python toolkit that uses heterogeneous graph neural networks to infer socioeconomic indicators (Carbon, GDP, Population, PM2.5, Night Light) from spatial data. Published at ACM SIGSPATIAL 2025.

TL;DR — Turn your geographic raster/vector data into a heterogeneous graph, then let message-passing GNNs learn multi-scale spatial relationships that traditional geostatistics cannot capture. Three lines of Python to go from scatter points to grid predictions.

from geohg import GeoHGInterpolator
interp, result = GeoHGInterpolator.from_dataframe(df, coord_columns=["lon","lat"], target_column="value", resolution=0.01)
# result["predictions"] -> (n_lat, n_lon) grid array
@inproceedings{zou2024space,
  title={Space-aware Socioeconomic Indicator Inference with Heterogeneous Graphs},
  author={Zou, Xingchen and Huang, Jiani and Hao, Xixuan and Yang, Yuhao and Wen, Haomin and Yan, Yibo and Huang, Chao and Chen, Chao and Liang, Yuxuan},
  booktitle={The 33rd ACM International Conference on Advances in Geographic Information Systems},
  year={2025}
}

Why Graphs for Geography?

"Everything is related to everything else, but near things are more related than distant things." — Tobler's First Law of Geography

Traditional spatial methods (IDW, Kriging, spatial regression) operationalize Tobler's Law through distance-based weights — the closer two locations are, the more influence they have on each other. This works well but misses a crucial insight:

Spatial relationships are richer than just proximity.

  • Two forests 50 km apart may behave more similarly than a forest and a factory next door
  • Commercial zones across a city share economic patterns regardless of distance
  • POI distributions (restaurants, offices, parks) reveal functional urban structure that pure distance ignores

GeoHG captures all these relationships in a single heterogeneous graph:

           ┌─────────────────────────────────────────────────────┐
           │         The GeoHG Heterogeneous Graph               │
           │                                                     │
           │    [Forest]         [Urban]         [Water]         │
           │    Entity           Entity          Entity          │
           │   ╱  │  ╲          ╱  │  ╲         ╱    ╲          │
           │  ╱   │   ╲        ╱   │   ╲       ╱      ╲         │
           │ a₁   a₅   a₈    a₂   a₄   a₇   a₃      a₆        │
           │  ╲   │   ╱ ╲    ╱    │   ╱                         │
           │   ╲  │  ╱   ╲  ╱     │  ╱    ← spatial adjacency   │
           │    ╲ │ ╱     ╲╱      │ ╱       (8-neighbor grid)   │
           │     a₉──────a₁₀─────a₁₁                           │
           │                                                     │
           │  ▪ area nodes   = grid cells (land cover features) │
           │  ▪ entity nodes = land cover categories (hypernode) │
           │  ▪ poi nodes    = POI categories (hypernode)        │
           │  ▪ edges        = near / locate / rev_locate        │
           └─────────────────────────────────────────────────────┘

Message passing on this graph lets information flow through both spatial proximity AND semantic similarity — something traditional geostatistics cannot do. Each GNN layer aggregates features from neighbors, progressively building richer representations that encode:

  1. Local context — what does this cell's neighborhood look like? (spatial adjacency)
  2. Global patterns — how do all areas of the same land cover type behave? (entity hypernodes)
  3. Functional similarity — what urban functions are present? (POI hypernodes)

GeoHG vs. Traditional Spatial Methods

Method Spatial Relationships Feature Utilization Scalability Multi-task
IDW Distance-only None High No
Kriging Variogram (distance) Limited (co-kriging) Medium No
Spatial Regression (GWR) Distance-weighted Linear Medium No
Random Forest None (tabular) All features High No
GeoHG (Ours) Multi-relational graph All features + structure High Yes

Highlights

Spatial Interpolation Flexible Data Input Heterogeneous GNN SSL Pretraining
Kriging-style interpolation via GeoHGInterpolator — from scatter points to grid predictions in 3 lines Bring your own CSV/DataFrame, ESA WorldCover TIF, or use built-in sample data for 4 cities x 5 indicators Dynamic node type inference — entity/POI counts detected from data, not hardcoded Contrastive pretraining with graph-structure and feature-similarity neighbors

Benchmark Results

Results on built-in Guangzhou dataset (8,540 grid cells, 70% masked as test set, seed=0):

Task Metric Train Validation Note
Carbon 0.865 0.860 2000 epochs, early-stop at ~170
GDP Run geohg train --task GDP
Population Run geohg train --task Population
PM2.5 Run geohg train --task PM25
Night Light Run geohg train --task Light

Reproduce with: geohg train --config configs/examples/guangzhou_carbon.yaml All 4 cities (GZ/BJ/SH/SZ) x 5 indicators are available. Run your own benchmarks!


Table of Contents


1. Installation

conda create -n geohg python=3.10 -y
conda activate geohg

git clone <your-repo-url>
cd GeoHG

# Standard install
pip install -e .

# With TIF/rasterio support (for building graphs from ESA WorldCover)
pip install -e ".[geo]"

# With development tools
pip install -e ".[dev]"

Verify installation:

geohg --version
python -c "from geohg import GeoHGPipeline; print('OK')"

2. Quick Start

30-Second Demo

from geohg import GeoHGPipeline

# One line: load built-in Guangzhou data, train, evaluate
results = GeoHGPipeline.quick_start(city="GZ", task="Carbon")
print(f"Test R²: {results['r2']:.4f}")

Spatial Interpolation (3 lines)

from geohg import GeoHGInterpolator

interp, result = GeoHGInterpolator.from_dataframe(
    df, coord_columns=["lon", "lat"],
    target_column="value", resolution=0.01
)
# result["predictions"] is a (n_lat, n_lon) grid array

See Section 3 for details and notebooks/quickstart_interpolation.ipynb for a full tutorial.

CLI (recommended for experiments)

# Train a model
geohg train --config configs/examples/guangzhou_carbon.yaml

# End-to-end: build graph + train + evaluate
geohg run --config configs/examples/guangzhou_carbon.yaml

Python API

from geohg import GeoHGPipeline

pipeline = GeoHGPipeline.from_yaml("configs/examples/guangzhou_carbon.yaml")
results = pipeline.run()

3. Spatial Interpolation

GeoHG can perform spatial interpolation similar to Kriging, but using heterogeneous graph neural networks on discrete grids.

Key difference from Kriging: GeoHG divides space into regular grid cells (user-specified resolution, e.g., 0.01 deg ~ 1km), constructs a heterogeneous graph over the grid, and predicts values for each cell — rather than continuous interpolation.

Scatter points (N, 2)          GeoHG Pipeline
--------------------------     ---------------------------
coords + features + values     1. Grid discretization (resolution)
      |                        2. 8-neighbor adjacency + entity hyper-nodes
      +-->  UserDataSource --> 3. HeteroGraphBuilder -> HeteroData
                               4. Train GeoHGModel (observed -> train/val)
                               5. Predict all grid cells -> (n_lat, n_lon) array

Python API

from geohg import GeoHGInterpolator
import numpy as np

coords = np.column_stack([lons, lats])   # (N, 2)
features = np.column_stack([f1, f2, f3]) # (N, F)

# One-liner from DataFrame
interp, result = GeoHGInterpolator.from_dataframe(
    df, coord_columns=["lon", "lat"],
    target_column="value", resolution=0.01
)

# Or step by step
interp = GeoHGInterpolator(resolution=0.01, epochs=1000)
interp.fit(coords, features, values)
result = interp.predict()  # {"predictions": (n_lat, n_lon), "lons": ..., "lats": ...}

CLI

geohg interpolate \
  --data my_data.csv \
  --coord-columns lon,lat \
  --target-column target \
  --resolution 0.01 \
  --output predictions.csv \
  --epochs 1000

Notebook Tutorial

See notebooks/quickstart_interpolation.ipynb for a complete tutorial including:

  • Synthetic data interpolation
  • Real-world case: Guangzhou carbon emission interpolation with 8,540 grid cells

4. Core Concepts: From Geographic Space to Graph

This section explains the key ideas behind GeoHG for GIS practitioners.

Step 1: Grid Discretization

Like rasterization in GIS, GeoHG divides the study area into regular grid cells. Each cell becomes an area node in the graph, with land cover ratios as its feature vector.

Continuous geographic space          Discrete grid (area nodes)
┌────────────────────┐               ┌───┬───┬───┬───┐
│  ~  forest  ~      │               │ a₀│ a₁│ a₂│ a₃│  Each cell:
│    ┌──urban──┐     │   ────────>   ├───┼───┼───┼───┤  - land cover ratios
│    │ ■■■■■■■ │     │   grid at     │ a₄│ a₅│ a₆│ a₇│  - position encoding
│    └─────────┘     │   0.01° res   ├───┼───┼───┼───┤  - (optional POI counts)
│  ~ water ~~~~~     │               │ a₈│ a₉│a₁₀│a₁₁│
└────────────────────┘               └───┴───┴───┴───┘

Step 2: Spatial Adjacency (Tobler's Law)

Each grid cell is connected to its 8 neighbors (queen contiguity), forming (area, near, area) edges. This encodes Tobler's First Law — near things are more related.

┌───┬───┬───┐
│ ↖ │ ↑ │ ↗ │    8-neighbor connectivity
├───┼───┼───┤    = Queen contiguity in GIS
│ ← │ ● │ → │    = (area, near, area) edges
├───┼───┼───┤
│ ↙ │ ↓ │ ↘ │
└───┴───┴───┘

Step 3: Entity Hypernodes (Beyond Distance)

Here is where GeoHG goes beyond traditional spatial methods. Each land cover category (forest, urban, water, ...) becomes an entity hypernode that connects to all grid cells containing that category. This creates shortcuts in the graph:

                [Forest Entity]
               ╱       │       ╲
         a₀(70%)   a₅(40%)   a₁₁(90%)    ← cells with forest cover
              \        |        /
               ╲       │       ╱
                [Urban Entity]
               ╱       │       ╲
         a₃(80%)   a₆(55%)   a₇(60%)     ← cells with urban cover

Why this matters: A forest cell in the north and a forest cell in the south — even if far apart — can exchange information through the shared Forest entity node. This captures landscape-level patterns that distance-based methods miss entirely.

Step 4: POI Hypernodes (Urban Function)

Similarly, POI categories (restaurants, offices, parks, hospitals, ...) become hypernode types. Grid cells are connected to POI nodes based on their POI distributions, capturing functional urban structure.

Step 5: Heterogeneous Message Passing

The complete graph has 3 node types and 5 edge types:

Node Type Count (Guangzhou) Features
area 8,540 Land cover ratios + position encoding
entity 9 Identity (one-hot)
poi 14 Identity (one-hot)
Edge Type Count (Guangzhou) Meaning
(area, near, area) 67,172 Spatial adjacency
(entity, locate, area) 24,902 Entity covers area
(area, rev_locate, entity) 24,902 Reverse of above
(poi, locate, area) 9,957 POI exists in area
(area, rev_locate, poi) 9,957 Reverse of above

A GNN with to_hetero() conversion learns separate message functions for each edge type, then aggregates them — like having specialized spatial analysis for each type of geographic relationship, all learned end-to-end.


5. Application Scenarios

GeoHG is designed for any task where you need to infer a spatially distributed socioeconomic indicator from land cover and/or POI data:

Scenario Target Variable Input Features Example
Carbon Emission Mapping CO₂ emissions per grid cell Land cover ratios, POI density Urban carbon inventory
Economic Activity Estimation GDP per grid cell Land cover, commercial POI Regional economic assessment
Population Distribution Population density Built-up area ratio, residential POI Census disaggregation
Air Quality Prediction PM2.5 concentration Vegetation ratio, industrial land Environmental monitoring
Night Light Estimation Luminosity index Urban land ratio, commercial POI Urbanization tracking
Custom Indicator Your own target Your CSV/DataFrame features GeoHGInterpolator.from_dataframe()

When to use GeoHG over traditional methods:

  • You have multi-dimensional features (not just coordinates)
  • Your study area has heterogeneous land cover or diverse urban functions
  • You want to leverage structural similarity between distant but functionally similar areas
  • You need predictions at grid-cell resolution across the entire study area

6. Features

  • Spatial interpolation — Kriging-style interpolation via GeoHGInterpolator (scatter points -> grid predictions)
  • End-to-end pipeline — from raw data to trained model in one command or a few lines of Python
  • Modular architecture — configurable GNN encoder, MLP head, and training components
  • Dynamic graph construction — entity/POI node types inferred from data, not hardcoded
  • Multiple data sources — custom CSV/DataFrame, ESA WorldCover TIF, or built-in sample data
  • POI optional — works with or without POI data (hypernode: entity or mono)
  • Self-supervised pretraining — contrastive GNN pretraining with neighbor sampling
  • YAML configuration — three-level priority: defaults < user config < CLI overrides
  • CPU/GPU support — automatic fallback to CPU when CUDA is unavailable

7. Project Structure

GeoHG/
├── pyproject.toml                    # Package definition & dependencies
├── configs/
│   ├── default.yaml                  # Global default configuration
│   └── examples/
│       ├── guangzhou_carbon.yaml     # Guangzhou carbon emission example
│       ├── no_poi.yaml              # Training without POI data
│       └── ssl_pretrain.yaml        # Self-supervised pretraining
├── geohg/                            # Main package
│   ├── config/                       # Dataclass config schema + YAML loader
│   ├── data/
│   │   ├── sources/                  # Data source abstractions
│   │   │   ├── base.py              # LandCoverSource ABC + GraphRawData
│   │   │   ├── esa_worldcover.py    # ESA TIF processing (optional GDAL)
│   │   │   ├── custom_landcover.py  # User-provided CSV
│   │   │   └── poi.py              # Optional POI augmentation
│   │   ├── builders/                 # Graph construction
│   │   │   ├── graph.py             # HeteroGraphBuilder -> PyG HeteroData
│   │   │   ├── adjacency.py         # 8-neighbor grid adjacency
│   │   │   ├── grid.py              # TIF grid utilities
│   │   │   └── features.py          # Pixel-to-ratio feature extraction
│   │   ├── legacy.py                # Backward-compatible loader
│   │   └── transforms.py            # TargetNormalizer + train/val/test split
│   ├── models/
│   │   ├── gnn.py                   # GNNEncoder (configurable layers/type)
│   │   ├── heads.py                 # MLPHead (configurable dims)
│   │   ├── geohg.py                 # GeoHGModel = GNN + to_hetero + Head
│   │   └── ssl.py                   # Contrastive loss + SSLNeighborDataset
│   ├── training/
│   │   ├── trainer.py               # Supervised training loop
│   │   ├── ssl_trainer.py           # SSL pretraining loop
│   │   ├── evaluator.py             # R2/RMSE/MAE metrics
│   │   └── callbacks.py             # EarlyStopping + ModelCheckpoint
│   ├── pipeline.py                   # GeoHGPipeline (end-to-end orchestration)
│   ├── interpolator.py               # GeoHGInterpolator (spatial interpolation)
│   ├── cli/                          # Click CLI commands
│   │   ├── main.py                  # geohg entry point
│   │   ├── train.py                 # geohg train
│   │   ├── pretrain.py              # geohg pretrain
│   │   ├── evaluate.py              # geohg evaluate
│   │   ├── run.py                   # geohg run (end-to-end)
│   │   ├── build.py                 # geohg build-graph
│   │   └── interpolate.py           # geohg interpolate
│   ├── visualization/                # Training & prediction plots
│   └── utils/                        # Seed, device, logging
├── data/                             # Sample data (4 cities)
│   ├── Hyper_Graph/{GZ,BJ,SH,SZ}/  # Graph & feature files
│   └── downstream_tasks/{city}/     # Task label files
├── notebooks/                        # Jupyter tutorials
│   └── quickstart_interpolation.ipynb
└── tests/                            # Test suite

8. Data Format

Custom CSV (recommended)

The easiest way to use GeoHG with your own data — provide three CSV files:

from geohg.data.sources.custom_landcover import CustomLandCoverSource

source = CustomLandCoverSource(
    feature_csv="my_features.csv",   # area_id, type1_ratio, type2_ratio, ...
    coord_csv="my_coords.csv",       # area_id, lon, lat
    adjacency_csv="my_edges.csv",    # src_id, dst_id
)
raw = source.load()

Or use GeoHGInterpolator.from_dataframe(df, ...) to feed a DataFrame directly (see Section 3).

ESA WorldCover TIF

Build graphs from GeoTIFF with automatic grid and adjacency construction (requires [geo] extra):

pip install -e ".[geo]"
geohg build-graph --tif-path ESA_WorldCover.tif --bbox 113.09 113.69 22.40 23.42 --output-dir data/my_city

Built-in sample data

The repository ships with sample data for 4 cities (GZ/BJ/SH/SZ) x 5 indicators under data/, all in CSV format. Use them directly with GeoHGPipeline.quick_start(city="GZ", task="Carbon").

Graph files under data/Hyper_Graph/{city}/:

File Columns Description
adjacency.csv src_id, dst_id Area adjacency edges
entity_area.csv entity_id, area_id, proportion Entity-locate-area relations
poi_area.csv poi_id, area_id, proportion POI-locate-area relations
pos_encode.csv area_id, x, y Grid position encodings
TIF_feature.csv File, Coordinates, Area, Value_*_Ratio Land cover feature ratios
POI_feature.csv TIF, POI_0, ..., POI_13, Count POI feature ratios

Task labels under data/downstream_tasks/{city}/:

File Columns Description
Carbon.csv area_id, value Carbon emission
GDP.csv area_id, value GDP
Population.csv area_id, value Population
PM25.csv area_id, value PM2.5
Light.csv area_id, value Night light

Entity/POI type counts are automatically detected from the data files.


9. CLI Reference

All commands accept --config <path> for YAML configuration. CLI flags override config values.

Command Description
geohg train Supervised training
geohg pretrain Self-supervised contrastive pretraining
geohg evaluate --model-path <path> Evaluate a saved model
geohg run End-to-end pipeline (build + train + evaluate)
geohg build-graph Build graph from TIF (requires [geo] extra)
geohg interpolate Spatial interpolation from CSV scatter data

Common options:

geohg train --config configs/examples/guangzhou_carbon.yaml \
  --city GZ \
  --task Carbon \
  --epochs 2000 \
  --lr 0.01 \
  --masked-ratio 0.7 \
  --metric r2 \
  --gpu 0

10. Configuration

Configuration uses a three-level priority system: configs/default.yaml < user YAML < CLI flags.

# configs/examples/guangzhou_carbon.yaml
data:
  city: GZ
  task: Carbon
  prebuilt_dir: data/Hyper_Graph
  downstream_dir: data/downstream_tasks
  pos_embedding: true
  hypernode: all            # all | entity | poi | mono
  entity_thresh: 0.0
  poi_thresh: 0.0

model:
  hidden_channels: 64
  num_gnn_layers: 3
  gnn_type: GraphConv       # GraphConv | SAGEConv
  head_dims: [32, 16]
  dropout: 0.5

training:
  epochs: 2000
  lr: 0.01
  metric: r2                # r2 | rmse | mae
  patience: 100
  masked_ratio: 0.7
  seed: 0

output:
  log_dir: outputs/logs
  model_dir: outputs/models
  plot_dir: outputs/plots

See configs/examples/ for more configuration templates.


11. Self-Supervised Pretraining

Contrastive pretraining of the GNN encoder using graph-structure and feature-similarity neighbors:

# Pretrain
geohg pretrain --config configs/examples/ssl_pretrain.yaml --city GZ

# Fine-tune with pretrained encoder
geohg train --config configs/examples/guangzhou_carbon.yaml \
  --pretrained-gnn outputs/models/ssl_GZ.pth \
  --freeze-gnn

Or via Python:

from geohg import GeoHGPipeline

pipeline = GeoHGPipeline.from_yaml("configs/examples/ssl_pretrain.yaml")
results = pipeline.run(skip_pretrain=False)

Why SSL? When labeled data is scarce (e.g., only a few ground-truth monitoring stations), self-supervised pretraining learns useful spatial representations from the graph structure alone, then fine-tuning on the small labeled set often yields better results than training from scratch.


12. Building Graphs from TIF

GeoHG can build heterogeneous graphs directly from ESA WorldCover GeoTIFF files:

pip install -e ".[geo]"

geohg build-graph \
  --tif-path path/to/ESA_WorldCover.tif \
  --bbox 113.09 113.69 22.40 23.42 \
  --output-dir data/my_city

The output files (feature CSV, adjacency, position encodings) can then be used with geohg train.

For custom data without GDAL, provide CSVs directly — see Data Format.

Supported land cover sources:

  • ESA WorldCover 10m — global land cover at 10m resolution
  • Custom GeoTIFF — any categorical raster with land cover classes
  • Custom CSV — pre-computed land cover ratios per grid cell

13. FAQ

Q: How is GeoHG different from Kriging / IDW?

Kriging and IDW are distance-based interpolation methods — they predict values at unknown locations using weighted averages of nearby observations, where weights depend solely on distance (and variogram in Kriging's case).

GeoHG uses a fundamentally different approach: it constructs a heterogeneous graph that encodes multiple types of spatial relationships (proximity, land cover similarity, urban function), then uses graph neural networks to learn non-linear prediction functions. This means:

  1. GeoHG can leverage multi-dimensional features (land cover ratios, POI distributions), not just coordinates
  2. GeoHG captures non-distance relationships (two forests far apart share an entity hypernode)
  3. GeoHG outputs predictions at grid-cell resolution rather than continuous points

Trade-off: GeoHG requires training data and computation; Kriging is a closed-form solution. For small datasets with only coordinates, Kriging may be simpler and sufficient.

Q: Do I need a GPU?

No. GeoHG automatically falls back to CPU when CUDA is unavailable. Training on CPU is slower but fully functional. For the built-in Guangzhou dataset (~8,500 nodes), CPU training completes in a few minutes.

Q: Can I use my own data without TIF files?

Yes. You have three options:

  1. DataFrameGeoHGInterpolator.from_dataframe(df, ...) (easiest)
  2. CSV files — provide feature CSV + coordinate CSV + adjacency CSV via CustomLandCoverSource
  3. TIF — use geohg build-graph to automatically extract features (requires [geo] extra)

Option 1 is the simplest: just pass a pandas DataFrame with coordinate columns, a target column, and any additional feature columns.

Q: What grid resolution should I use?

This depends on your study area and data density:

  • 0.01° (~1 km) — good default for city-scale analysis
  • 0.005° (~500 m) — finer resolution, requires denser observation points
  • 0.02° (~2 km) — coarser, works with sparser data
  • The built-in data uses approximately 0.006° resolution for Guangzhou

Rule of thumb: ensure you have at least 3-5 observations per grid cell on average for reliable training.

Q: How does the entity hypernode work exactly?

Each land cover category (e.g., "Tree cover", "Built-up", "Water body") becomes a hypernode. An edge connects the entity hypernode to every grid cell that contains that land cover type, weighted by the proportion of that land cover in the cell. During message passing, the entity node aggregates information from all connected grid cells, then broadcasts it back — effectively letting all "forest cells" share information regardless of distance.

Q: Can I add my own node/edge types?

The current version supports area, entity, and POI node types. To add custom node types, you would extend GraphRawData in geohg/data/sources/base.py and update HeteroGraphBuilder in geohg/data/builders/graph.py. The GNN encoder automatically adapts via PyG's to_hetero().


14. Contributing

Contributions are welcome! Here's how to get started:

# Clone and install in development mode
git clone <your-repo-url>
cd GeoHG
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run a quick training to verify
geohg train --config configs/examples/guangzhou_carbon.yaml --epochs 50

Areas where contributions are particularly welcome:

  • New data sources (satellite imagery, census data, mobility data)
  • Additional GNN architectures (GAT, GIN, etc.)
  • Visualization improvements
  • New city datasets
  • Documentation and tutorials

License

MIT License. See pyproject.toml for details.

Acknowledgments

GeoHG builds on:

  • PyTorch Geometric — the heterogeneous graph framework
  • ESA WorldCover — global land cover data at 10m resolution
  • The urban computing and GeoAI research community

If you use GeoHG in your research, please cite our SIGSPATIAL 2025 paper (see top of this README).

About

ACM SIGSPATIAL 2025-Space-aware Socioeconomic Indicator Inference with Heterogeneous Graphs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors