English | 中文
A modular, pip-installable Python toolkit that uses heterogeneous graph neural networks to infer socioeconomic indicators (Carbon, GDP, Population, PM2.5, Night Light) from spatial data. Published at ACM SIGSPATIAL 2025.
TL;DR — Turn your geographic raster/vector data into a heterogeneous graph, then let message-passing GNNs learn multi-scale spatial relationships that traditional geostatistics cannot capture. Three lines of Python to go from scatter points to grid predictions.
from geohg import GeoHGInterpolator
interp, result = GeoHGInterpolator.from_dataframe(df, coord_columns=["lon","lat"], target_column="value", resolution=0.01)
# result["predictions"] -> (n_lat, n_lon) grid array@inproceedings{zou2024space,
title={Space-aware Socioeconomic Indicator Inference with Heterogeneous Graphs},
author={Zou, Xingchen and Huang, Jiani and Hao, Xixuan and Yang, Yuhao and Wen, Haomin and Yan, Yibo and Huang, Chao and Chen, Chao and Liang, Yuxuan},
booktitle={The 33rd ACM International Conference on Advances in Geographic Information Systems},
year={2025}
}"Everything is related to everything else, but near things are more related than distant things." — Tobler's First Law of Geography
Traditional spatial methods (IDW, Kriging, spatial regression) operationalize Tobler's Law through distance-based weights — the closer two locations are, the more influence they have on each other. This works well but misses a crucial insight:
Spatial relationships are richer than just proximity.
- Two forests 50 km apart may behave more similarly than a forest and a factory next door
- Commercial zones across a city share economic patterns regardless of distance
- POI distributions (restaurants, offices, parks) reveal functional urban structure that pure distance ignores
GeoHG captures all these relationships in a single heterogeneous graph:
┌─────────────────────────────────────────────────────┐
│ The GeoHG Heterogeneous Graph │
│ │
│ [Forest] [Urban] [Water] │
│ Entity Entity Entity │
│ ╱ │ ╲ ╱ │ ╲ ╱ ╲ │
│ ╱ │ ╲ ╱ │ ╲ ╱ ╲ │
│ a₁ a₅ a₈ a₂ a₄ a₇ a₃ a₆ │
│ ╲ │ ╱ ╲ ╱ │ ╱ │
│ ╲ │ ╱ ╲ ╱ │ ╱ ← spatial adjacency │
│ ╲ │ ╱ ╲╱ │ ╱ (8-neighbor grid) │
│ a₉──────a₁₀─────a₁₁ │
│ │
│ ▪ area nodes = grid cells (land cover features) │
│ ▪ entity nodes = land cover categories (hypernode) │
│ ▪ poi nodes = POI categories (hypernode) │
│ ▪ edges = near / locate / rev_locate │
└─────────────────────────────────────────────────────┘
Message passing on this graph lets information flow through both spatial proximity AND semantic similarity — something traditional geostatistics cannot do. Each GNN layer aggregates features from neighbors, progressively building richer representations that encode:
- Local context — what does this cell's neighborhood look like? (spatial adjacency)
- Global patterns — how do all areas of the same land cover type behave? (entity hypernodes)
- Functional similarity — what urban functions are present? (POI hypernodes)
| Method | Spatial Relationships | Feature Utilization | Scalability | Multi-task |
|---|---|---|---|---|
| IDW | Distance-only | None | High | No |
| Kriging | Variogram (distance) | Limited (co-kriging) | Medium | No |
| Spatial Regression (GWR) | Distance-weighted | Linear | Medium | No |
| Random Forest | None (tabular) | All features | High | No |
| GeoHG (Ours) | Multi-relational graph | All features + structure | High | Yes |
| Spatial Interpolation | Flexible Data Input | Heterogeneous GNN | SSL Pretraining |
|---|---|---|---|
Kriging-style interpolation via GeoHGInterpolator — from scatter points to grid predictions in 3 lines |
Bring your own CSV/DataFrame, ESA WorldCover TIF, or use built-in sample data for 4 cities x 5 indicators | Dynamic node type inference — entity/POI counts detected from data, not hardcoded | Contrastive pretraining with graph-structure and feature-similarity neighbors |
Results on built-in Guangzhou dataset (8,540 grid cells, 70% masked as test set, seed=0):
| Task | Metric | Train | Validation | Note |
|---|---|---|---|---|
| Carbon | R² | 0.865 | 0.860 | 2000 epochs, early-stop at ~170 |
| GDP | R² | — | — | Run geohg train --task GDP |
| Population | R² | — | — | Run geohg train --task Population |
| PM2.5 | R² | — | — | Run geohg train --task PM25 |
| Night Light | R² | — | — | Run geohg train --task Light |
Reproduce with:
geohg train --config configs/examples/guangzhou_carbon.yamlAll 4 cities (GZ/BJ/SH/SZ) x 5 indicators are available. Run your own benchmarks!
- 1. Installation
- 2. Quick Start
- 3. Spatial Interpolation
- 4. Core Concepts: From Geographic Space to Graph
- 5. Application Scenarios
- 6. Features
- 7. Project Structure
- 8. Data Format
- 9. CLI Reference
- 10. Configuration
- 11. Self-Supervised Pretraining
- 12. Building Graphs from TIF
- 13. FAQ
- 14. Contributing
conda create -n geohg python=3.10 -y
conda activate geohg
git clone <your-repo-url>
cd GeoHG
# Standard install
pip install -e .
# With TIF/rasterio support (for building graphs from ESA WorldCover)
pip install -e ".[geo]"
# With development tools
pip install -e ".[dev]"Verify installation:
geohg --version
python -c "from geohg import GeoHGPipeline; print('OK')"from geohg import GeoHGPipeline
# One line: load built-in Guangzhou data, train, evaluate
results = GeoHGPipeline.quick_start(city="GZ", task="Carbon")
print(f"Test R²: {results['r2']:.4f}")from geohg import GeoHGInterpolator
interp, result = GeoHGInterpolator.from_dataframe(
df, coord_columns=["lon", "lat"],
target_column="value", resolution=0.01
)
# result["predictions"] is a (n_lat, n_lon) grid arraySee Section 3 for details and notebooks/quickstart_interpolation.ipynb for a full tutorial.
# Train a model
geohg train --config configs/examples/guangzhou_carbon.yaml
# End-to-end: build graph + train + evaluate
geohg run --config configs/examples/guangzhou_carbon.yamlfrom geohg import GeoHGPipeline
pipeline = GeoHGPipeline.from_yaml("configs/examples/guangzhou_carbon.yaml")
results = pipeline.run()GeoHG can perform spatial interpolation similar to Kriging, but using heterogeneous graph neural networks on discrete grids.
Key difference from Kriging: GeoHG divides space into regular grid cells (user-specified resolution, e.g., 0.01 deg ~ 1km), constructs a heterogeneous graph over the grid, and predicts values for each cell — rather than continuous interpolation.
Scatter points (N, 2) GeoHG Pipeline
-------------------------- ---------------------------
coords + features + values 1. Grid discretization (resolution)
| 2. 8-neighbor adjacency + entity hyper-nodes
+--> UserDataSource --> 3. HeteroGraphBuilder -> HeteroData
4. Train GeoHGModel (observed -> train/val)
5. Predict all grid cells -> (n_lat, n_lon) array
from geohg import GeoHGInterpolator
import numpy as np
coords = np.column_stack([lons, lats]) # (N, 2)
features = np.column_stack([f1, f2, f3]) # (N, F)
# One-liner from DataFrame
interp, result = GeoHGInterpolator.from_dataframe(
df, coord_columns=["lon", "lat"],
target_column="value", resolution=0.01
)
# Or step by step
interp = GeoHGInterpolator(resolution=0.01, epochs=1000)
interp.fit(coords, features, values)
result = interp.predict() # {"predictions": (n_lat, n_lon), "lons": ..., "lats": ...}geohg interpolate \
--data my_data.csv \
--coord-columns lon,lat \
--target-column target \
--resolution 0.01 \
--output predictions.csv \
--epochs 1000See notebooks/quickstart_interpolation.ipynb for a complete tutorial including:
- Synthetic data interpolation
- Real-world case: Guangzhou carbon emission interpolation with 8,540 grid cells
This section explains the key ideas behind GeoHG for GIS practitioners.
Like rasterization in GIS, GeoHG divides the study area into regular grid cells. Each cell becomes an area node in the graph, with land cover ratios as its feature vector.
Continuous geographic space Discrete grid (area nodes)
┌────────────────────┐ ┌───┬───┬───┬───┐
│ ~ forest ~ │ │ a₀│ a₁│ a₂│ a₃│ Each cell:
│ ┌──urban──┐ │ ────────> ├───┼───┼───┼───┤ - land cover ratios
│ │ ■■■■■■■ │ │ grid at │ a₄│ a₅│ a₆│ a₇│ - position encoding
│ └─────────┘ │ 0.01° res ├───┼───┼───┼───┤ - (optional POI counts)
│ ~ water ~~~~~ │ │ a₈│ a₉│a₁₀│a₁₁│
└────────────────────┘ └───┴───┴───┴───┘
Each grid cell is connected to its 8 neighbors (queen contiguity), forming (area, near, area) edges. This encodes Tobler's First Law — near things are more related.
┌───┬───┬───┐
│ ↖ │ ↑ │ ↗ │ 8-neighbor connectivity
├───┼───┼───┤ = Queen contiguity in GIS
│ ← │ ● │ → │ = (area, near, area) edges
├───┼───┼───┤
│ ↙ │ ↓ │ ↘ │
└───┴───┴───┘
Here is where GeoHG goes beyond traditional spatial methods. Each land cover category (forest, urban, water, ...) becomes an entity hypernode that connects to all grid cells containing that category. This creates shortcuts in the graph:
[Forest Entity]
╱ │ ╲
a₀(70%) a₅(40%) a₁₁(90%) ← cells with forest cover
\ | /
╲ │ ╱
[Urban Entity]
╱ │ ╲
a₃(80%) a₆(55%) a₇(60%) ← cells with urban cover
Why this matters: A forest cell in the north and a forest cell in the south — even if far apart — can exchange information through the shared Forest entity node. This captures landscape-level patterns that distance-based methods miss entirely.
Similarly, POI categories (restaurants, offices, parks, hospitals, ...) become hypernode types. Grid cells are connected to POI nodes based on their POI distributions, capturing functional urban structure.
The complete graph has 3 node types and 5 edge types:
| Node Type | Count (Guangzhou) | Features |
|---|---|---|
area |
8,540 | Land cover ratios + position encoding |
entity |
9 | Identity (one-hot) |
poi |
14 | Identity (one-hot) |
| Edge Type | Count (Guangzhou) | Meaning |
|---|---|---|
(area, near, area) |
67,172 | Spatial adjacency |
(entity, locate, area) |
24,902 | Entity covers area |
(area, rev_locate, entity) |
24,902 | Reverse of above |
(poi, locate, area) |
9,957 | POI exists in area |
(area, rev_locate, poi) |
9,957 | Reverse of above |
A GNN with to_hetero() conversion learns separate message functions for each edge type, then aggregates them — like having specialized spatial analysis for each type of geographic relationship, all learned end-to-end.
GeoHG is designed for any task where you need to infer a spatially distributed socioeconomic indicator from land cover and/or POI data:
| Scenario | Target Variable | Input Features | Example |
|---|---|---|---|
| Carbon Emission Mapping | CO₂ emissions per grid cell | Land cover ratios, POI density | Urban carbon inventory |
| Economic Activity Estimation | GDP per grid cell | Land cover, commercial POI | Regional economic assessment |
| Population Distribution | Population density | Built-up area ratio, residential POI | Census disaggregation |
| Air Quality Prediction | PM2.5 concentration | Vegetation ratio, industrial land | Environmental monitoring |
| Night Light Estimation | Luminosity index | Urban land ratio, commercial POI | Urbanization tracking |
| Custom Indicator | Your own target | Your CSV/DataFrame features | GeoHGInterpolator.from_dataframe() |
When to use GeoHG over traditional methods:
- You have multi-dimensional features (not just coordinates)
- Your study area has heterogeneous land cover or diverse urban functions
- You want to leverage structural similarity between distant but functionally similar areas
- You need predictions at grid-cell resolution across the entire study area
- Spatial interpolation — Kriging-style interpolation via
GeoHGInterpolator(scatter points -> grid predictions) - End-to-end pipeline — from raw data to trained model in one command or a few lines of Python
- Modular architecture — configurable GNN encoder, MLP head, and training components
- Dynamic graph construction — entity/POI node types inferred from data, not hardcoded
- Multiple data sources — custom CSV/DataFrame, ESA WorldCover TIF, or built-in sample data
- POI optional — works with or without POI data (
hypernode: entityormono) - Self-supervised pretraining — contrastive GNN pretraining with neighbor sampling
- YAML configuration — three-level priority: defaults < user config < CLI overrides
- CPU/GPU support — automatic fallback to CPU when CUDA is unavailable
GeoHG/
├── pyproject.toml # Package definition & dependencies
├── configs/
│ ├── default.yaml # Global default configuration
│ └── examples/
│ ├── guangzhou_carbon.yaml # Guangzhou carbon emission example
│ ├── no_poi.yaml # Training without POI data
│ └── ssl_pretrain.yaml # Self-supervised pretraining
├── geohg/ # Main package
│ ├── config/ # Dataclass config schema + YAML loader
│ ├── data/
│ │ ├── sources/ # Data source abstractions
│ │ │ ├── base.py # LandCoverSource ABC + GraphRawData
│ │ │ ├── esa_worldcover.py # ESA TIF processing (optional GDAL)
│ │ │ ├── custom_landcover.py # User-provided CSV
│ │ │ └── poi.py # Optional POI augmentation
│ │ ├── builders/ # Graph construction
│ │ │ ├── graph.py # HeteroGraphBuilder -> PyG HeteroData
│ │ │ ├── adjacency.py # 8-neighbor grid adjacency
│ │ │ ├── grid.py # TIF grid utilities
│ │ │ └── features.py # Pixel-to-ratio feature extraction
│ │ ├── legacy.py # Backward-compatible loader
│ │ └── transforms.py # TargetNormalizer + train/val/test split
│ ├── models/
│ │ ├── gnn.py # GNNEncoder (configurable layers/type)
│ │ ├── heads.py # MLPHead (configurable dims)
│ │ ├── geohg.py # GeoHGModel = GNN + to_hetero + Head
│ │ └── ssl.py # Contrastive loss + SSLNeighborDataset
│ ├── training/
│ │ ├── trainer.py # Supervised training loop
│ │ ├── ssl_trainer.py # SSL pretraining loop
│ │ ├── evaluator.py # R2/RMSE/MAE metrics
│ │ └── callbacks.py # EarlyStopping + ModelCheckpoint
│ ├── pipeline.py # GeoHGPipeline (end-to-end orchestration)
│ ├── interpolator.py # GeoHGInterpolator (spatial interpolation)
│ ├── cli/ # Click CLI commands
│ │ ├── main.py # geohg entry point
│ │ ├── train.py # geohg train
│ │ ├── pretrain.py # geohg pretrain
│ │ ├── evaluate.py # geohg evaluate
│ │ ├── run.py # geohg run (end-to-end)
│ │ ├── build.py # geohg build-graph
│ │ └── interpolate.py # geohg interpolate
│ ├── visualization/ # Training & prediction plots
│ └── utils/ # Seed, device, logging
├── data/ # Sample data (4 cities)
│ ├── Hyper_Graph/{GZ,BJ,SH,SZ}/ # Graph & feature files
│ └── downstream_tasks/{city}/ # Task label files
├── notebooks/ # Jupyter tutorials
│ └── quickstart_interpolation.ipynb
└── tests/ # Test suite
The easiest way to use GeoHG with your own data — provide three CSV files:
from geohg.data.sources.custom_landcover import CustomLandCoverSource
source = CustomLandCoverSource(
feature_csv="my_features.csv", # area_id, type1_ratio, type2_ratio, ...
coord_csv="my_coords.csv", # area_id, lon, lat
adjacency_csv="my_edges.csv", # src_id, dst_id
)
raw = source.load()Or use GeoHGInterpolator.from_dataframe(df, ...) to feed a DataFrame directly (see Section 3).
Build graphs from GeoTIFF with automatic grid and adjacency construction (requires [geo] extra):
pip install -e ".[geo]"
geohg build-graph --tif-path ESA_WorldCover.tif --bbox 113.09 113.69 22.40 23.42 --output-dir data/my_cityThe repository ships with sample data for 4 cities (GZ/BJ/SH/SZ) x 5 indicators under data/, all in CSV format. Use them directly with GeoHGPipeline.quick_start(city="GZ", task="Carbon").
Graph files under data/Hyper_Graph/{city}/:
| File | Columns | Description |
|---|---|---|
adjacency.csv |
src_id, dst_id |
Area adjacency edges |
entity_area.csv |
entity_id, area_id, proportion |
Entity-locate-area relations |
poi_area.csv |
poi_id, area_id, proportion |
POI-locate-area relations |
pos_encode.csv |
area_id, x, y |
Grid position encodings |
TIF_feature.csv |
File, Coordinates, Area, Value_*_Ratio |
Land cover feature ratios |
POI_feature.csv |
TIF, POI_0, ..., POI_13, Count |
POI feature ratios |
Task labels under data/downstream_tasks/{city}/:
| File | Columns | Description |
|---|---|---|
Carbon.csv |
area_id, value |
Carbon emission |
GDP.csv |
area_id, value |
GDP |
Population.csv |
area_id, value |
Population |
PM25.csv |
area_id, value |
PM2.5 |
Light.csv |
area_id, value |
Night light |
Entity/POI type counts are automatically detected from the data files.
All commands accept --config <path> for YAML configuration. CLI flags override config values.
| Command | Description |
|---|---|
geohg train |
Supervised training |
geohg pretrain |
Self-supervised contrastive pretraining |
geohg evaluate --model-path <path> |
Evaluate a saved model |
geohg run |
End-to-end pipeline (build + train + evaluate) |
geohg build-graph |
Build graph from TIF (requires [geo] extra) |
geohg interpolate |
Spatial interpolation from CSV scatter data |
Common options:
geohg train --config configs/examples/guangzhou_carbon.yaml \
--city GZ \
--task Carbon \
--epochs 2000 \
--lr 0.01 \
--masked-ratio 0.7 \
--metric r2 \
--gpu 0Configuration uses a three-level priority system: configs/default.yaml < user YAML < CLI flags.
# configs/examples/guangzhou_carbon.yaml
data:
city: GZ
task: Carbon
prebuilt_dir: data/Hyper_Graph
downstream_dir: data/downstream_tasks
pos_embedding: true
hypernode: all # all | entity | poi | mono
entity_thresh: 0.0
poi_thresh: 0.0
model:
hidden_channels: 64
num_gnn_layers: 3
gnn_type: GraphConv # GraphConv | SAGEConv
head_dims: [32, 16]
dropout: 0.5
training:
epochs: 2000
lr: 0.01
metric: r2 # r2 | rmse | mae
patience: 100
masked_ratio: 0.7
seed: 0
output:
log_dir: outputs/logs
model_dir: outputs/models
plot_dir: outputs/plotsSee configs/examples/ for more configuration templates.
Contrastive pretraining of the GNN encoder using graph-structure and feature-similarity neighbors:
# Pretrain
geohg pretrain --config configs/examples/ssl_pretrain.yaml --city GZ
# Fine-tune with pretrained encoder
geohg train --config configs/examples/guangzhou_carbon.yaml \
--pretrained-gnn outputs/models/ssl_GZ.pth \
--freeze-gnnOr via Python:
from geohg import GeoHGPipeline
pipeline = GeoHGPipeline.from_yaml("configs/examples/ssl_pretrain.yaml")
results = pipeline.run(skip_pretrain=False)Why SSL? When labeled data is scarce (e.g., only a few ground-truth monitoring stations), self-supervised pretraining learns useful spatial representations from the graph structure alone, then fine-tuning on the small labeled set often yields better results than training from scratch.
GeoHG can build heterogeneous graphs directly from ESA WorldCover GeoTIFF files:
pip install -e ".[geo]"
geohg build-graph \
--tif-path path/to/ESA_WorldCover.tif \
--bbox 113.09 113.69 22.40 23.42 \
--output-dir data/my_cityThe output files (feature CSV, adjacency, position encodings) can then be used with geohg train.
For custom data without GDAL, provide CSVs directly — see Data Format.
Supported land cover sources:
- ESA WorldCover 10m — global land cover at 10m resolution
- Custom GeoTIFF — any categorical raster with land cover classes
- Custom CSV — pre-computed land cover ratios per grid cell
Q: How is GeoHG different from Kriging / IDW?
Kriging and IDW are distance-based interpolation methods — they predict values at unknown locations using weighted averages of nearby observations, where weights depend solely on distance (and variogram in Kriging's case).
GeoHG uses a fundamentally different approach: it constructs a heterogeneous graph that encodes multiple types of spatial relationships (proximity, land cover similarity, urban function), then uses graph neural networks to learn non-linear prediction functions. This means:
- GeoHG can leverage multi-dimensional features (land cover ratios, POI distributions), not just coordinates
- GeoHG captures non-distance relationships (two forests far apart share an entity hypernode)
- GeoHG outputs predictions at grid-cell resolution rather than continuous points
Trade-off: GeoHG requires training data and computation; Kriging is a closed-form solution. For small datasets with only coordinates, Kriging may be simpler and sufficient.
Q: Do I need a GPU?
No. GeoHG automatically falls back to CPU when CUDA is unavailable. Training on CPU is slower but fully functional. For the built-in Guangzhou dataset (~8,500 nodes), CPU training completes in a few minutes.
Q: Can I use my own data without TIF files?
Yes. You have three options:
- DataFrame —
GeoHGInterpolator.from_dataframe(df, ...)(easiest) - CSV files — provide feature CSV + coordinate CSV + adjacency CSV via
CustomLandCoverSource - TIF — use
geohg build-graphto automatically extract features (requires[geo]extra)
Option 1 is the simplest: just pass a pandas DataFrame with coordinate columns, a target column, and any additional feature columns.
Q: What grid resolution should I use?
This depends on your study area and data density:
- 0.01° (~1 km) — good default for city-scale analysis
- 0.005° (~500 m) — finer resolution, requires denser observation points
- 0.02° (~2 km) — coarser, works with sparser data
- The built-in data uses approximately 0.006° resolution for Guangzhou
Rule of thumb: ensure you have at least 3-5 observations per grid cell on average for reliable training.
Q: How does the entity hypernode work exactly?
Each land cover category (e.g., "Tree cover", "Built-up", "Water body") becomes a hypernode. An edge connects the entity hypernode to every grid cell that contains that land cover type, weighted by the proportion of that land cover in the cell. During message passing, the entity node aggregates information from all connected grid cells, then broadcasts it back — effectively letting all "forest cells" share information regardless of distance.
Q: Can I add my own node/edge types?
The current version supports area, entity, and POI node types. To add custom node types, you would extend GraphRawData in geohg/data/sources/base.py and update HeteroGraphBuilder in geohg/data/builders/graph.py. The GNN encoder automatically adapts via PyG's to_hetero().
Contributions are welcome! Here's how to get started:
# Clone and install in development mode
git clone <your-repo-url>
cd GeoHG
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run a quick training to verify
geohg train --config configs/examples/guangzhou_carbon.yaml --epochs 50Areas where contributions are particularly welcome:
- New data sources (satellite imagery, census data, mobility data)
- Additional GNN architectures (GAT, GIN, etc.)
- Visualization improvements
- New city datasets
- Documentation and tutorials
MIT License. See pyproject.toml for details.
GeoHG builds on:
- PyTorch Geometric — the heterogeneous graph framework
- ESA WorldCover — global land cover data at 10m resolution
- The urban computing and GeoAI research community
If you use GeoHG in your research, please cite our SIGSPATIAL 2025 paper (see top of this README).