Spatially-Constrained Regionalization for Inference of Broadband Equity
SCRIBE is the research codebase for the paper "Beyond Data Points: Regionalizing Crowdsourced Latency Measurements" (ACM SIGMETRICS 2025). It turns sparse, unevenly distributed crowdsourced broadband measurements into coherent geographic regions that summarize latency performance — enabling policymakers and researchers to reason about internet equity at a regional rather than point-measurement level.
Note on data sources: The original research was conducted using Ookla speed test data, which is available only under a Data Use Agreement and cannot be redistributed. This repository demonstrates the same approach using M-Lab NDT data, which is openly available via BigQuery.
Large-scale crowdsourced measurement platforms (e.g., M-Lab, Ookla) generate millions of broadband performance measurements, but those measurements are spatially uneven: dense in urban cores, sparse in suburban and rural areas. Naive spatial aggregation over administrative boundaries (ZIP codes, census tracts) conflates areas with fundamentally different performance. SCRIBE addresses this by:
- Interpolating raw measurements to a continuous spatial field, filling coverage gaps.
- Tessellating the field into uniform H3 hexagonal cells to remove administrative-boundary bias.
- Clustering the tessellated cells into contiguous regions of statistically similar latency using the SKATER algorithm — producing data-driven broadband equity regions.
Raw measurements → Interpolate to grid → Overlay hexagons → Aggregate per cell → Cluster
make fetch # pull M-Lab NDT MinRTT data from BigQuery
make interpolate # interpolate point measurements to a regular grid (default: IDW)
make aggregate # overlay H3 hexagons; compute per-cell latency distribution stats
make cluster # SKATER spatial clustering on hex aggregates
make evaluate # pairwise Adjusted Rand Index stability score across time periods
make all # interpolate + aggregate + cluster + evaluateAll targets operate at the level of a <City, Date Range> pair:
make all \
CITY=chicago \
START_DATE=2024-01-01 \
END_DATE=2024-03-31 \
GRANULARITY=week \
METHOD=idwRequires Python ≥ 3.11 and uv.
make setup # installs all dependencies via uv syncBigQuery auth uses a service account key file:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.jsonBefore running the pipeline for a new city, seed its boundary polygon from OSM:
uv run python src/seed_cities.py| Variable | Default | Description |
|---|---|---|
CITY |
chicago |
City name (must be in data/cities.geojson) |
START_DATE / END_DATE |
2024-01-01 / 2024-12-31 |
Date range |
GRANULARITY |
week |
Sub-period size for stability analysis: day, week, month |
METHOD |
idw |
Interpolation algorithm: idw, loess, kde |
RESOLUTION |
8 |
H3 hexagon resolution |
N_CLUSTERS |
auto |
Cluster count, or auto to detect via silhouette |
DISTANCE |
Euclidean |
SKATER dissimilarity metric |
All intermediates and outputs are written to /data/taveesh/scribe/:
raw/ {city}_{start}_{end}.parquet
interpolated/ {city}_{period_start}_{period_end}_{method}.parquet
aggregated/ {city}_{period_start}_{period_end}_{method}_res{N}.parquet
output/ {city}_{period_start}_{period_end}_{method}_res{N}_clusters.geojson
{city}_{start}_{end}_{granularity}_{method}_stability.json
If you use this code, please cite:
@article{sharma2025beyond,
title={Beyond data points: Regionalizing crowdsourced latency measurements},
author={Sharma, Taveesh and Schmitt, Paul and Bronzino, Francesco and Feamster, Nick and Marwell, Nicole P},
journal={Proceedings of the ACM on Measurement and Analysis of Computing Systems},
volume={8},
number={3},
pages={1--24},
year={2024},
publisher={ACM New York, NY, USA}
}