CryoSentinel

A foundation-model segmenter for glacial lakes from Sentinel-1 SAR, Sentinel-2 optical, and Copernicus DEM imagery, fine-tuned and evaluated under a spatial-block-split protocol over the Tien Shan, Zhetysu Alatau, and Ile Alatau ranges of Central Asia.

Weights · Method · Benchmarks · Label-noise audit · Limitations · Rights · Operational system (GLOFcast)

Figure 1. Sentinel-2 L2A true-colour panorama of the Ile Alatau range, summer 2023. Almaty (Kazakhstan, pop. 2.0 M) sits at the foot of the range in the upper left; the eastern shore of Lake Issyk-Kul (Kyrgyzstan) is visible in the lower right. The glacial-lake belt at 3,400–4,000 m elevation runs along the spine of the range and is the operational region targeted by CryoSentinel.

Figure 2. CryoSentinel finds glacial lakes that the public reference inventory misses. Held-out test chip at (42.99° N, 76.71° E), 2023. Left to right: Sentinel-2 RGB (two lakes clearly visible), MNDWI water index (independent physics confirmation, both lakes light blue), Kumar & Vijay 2026 ground truth (only one lake polygon — the model gets penalised for the second), CryoSentinel prediction (both lakes correctly segmented). The model is right; the supervision label is wrong. Full audit of seven such chips in docs/LABEL_NOISE_AUDIT.md.

What this is

CryoSentinel is the segmentation backbone behind GLOFcast — an open-source screening layer for glacial lake outburst flood (GLOF) hazard. The Tien Shan inventory of glacial lakes grew from roughly 1,800 lakes in 1990 to over 4,500 in 2023 (Bao et al., 2024). Existing operational early-warning services are physical (sensors deployed on the moraine dam) and cover the order of 25 priority lakes per pilot site. The remaining several thousand lakes can only be screened by satellite. CryoSentinel addresses that screening problem.

The model takes a 224 × 224 chip at 10 m/pixel containing twelve Sentinel-2 L2A bands, two Sentinel-1 GRD polarisations (VV and VH), and one Copernicus 30 m DEM band, and returns a binary water mask. The chip extent is roughly 2.2 × 2.2 km on the ground.

This repository releases the training and inference code, the configs that produced the headline numbers, two production checkpoints on Hugging Face, and the multimodal training dataset.

How it works, in plain language

For non-machine-learning readers, the model essentially answers one question per pixel: given what the satellite saw at this point on this date, is this water, or is it something else (rock, ice, debris, vegetation, snow)? It uses three independent kinds of information in parallel:

Optical imagery (Sentinel-2) — what the eye would see if you flew over the glacier on a cloudless day. Water absorbs near-infrared light strongly, so it shows up dark in those bands.
Radar imagery (Sentinel-1) — pulses of microwave energy that work through clouds and at night. Water surfaces reflect almost nothing back to the satellite, so they show up dark in radar too.
Topography (Copernicus DEM) — the elevation map. Water collects in flat-bottomed depressions; the model learns this geometry as a third independent constraint.

The model was pretrained on 9 million multimodal samples by IBM Research and the European Space Agency (the TerraMind foundation model, Apache 2.0), then fine-tuned by us on roughly 5,600 hand-labelled chips from the Tien Shan, Zhetysu Alatau, and Ile Alatau ranges. When all three input channels agree on "water", the model is highly confident; when they disagree, the model has to weigh the evidence — and on a small fraction of chips (~ 1 %) where the public reference labels are wrong, it can still get the answer right because two of three independent physical signals agree with each other.

The sections below are written for ML researchers and engineers. Geographers, glaciologists, and risk managers — the Method and Limitations documents are the more accessible entry points.

Affiliation disclosure

CryoSentinel is a research-stage prototype developed independently. It is conceived as a complement to existing public-agency early-warning systems — in particular «Казселезащита» (Kazakhstan) and the UNESCO GLOFCA programme — but is not currently affiliated with, endorsed by, or operationally integrated with these or any other agency. References to either organisation in this documentation describe institutional context, not partnership.

Headline result

Metric	Value	Notes
Validation IoU @ thr=0.5, TTA	0.9557	Same train/val protocol as Adhikari & Regmi (2025); they report 0.9130
Validation IoU @ best threshold (0.70)	0.9596
Held-out test IoU @ thr=0.5, TTA (n = 665)	0.8918	Adhikari & Regmi (2025) do not provide a held-out test split
Held-out test IoU after label-noise audit (n = 658)	0.9082	After dropping 7 chips with documented Kumar & Vijay (2026) polygon mislabelling
Per-region test IoU — Zhetysu Alatau	0.9312	best out-of-domain transfer
Per-region test IoU — Tien Shan (full)	0.9027
Per-region test IoU — Ile Alatau (label-corrected)	0.9285	from 0.7664 raw, +16.21 pp after noise audit

The model was trained on Sentinel-1 + Sentinel-2 + DEM jointly, in contrast to the SAR-only baseline of Adhikari & Regmi (2025). Train, validation, and test chips are spatially disjoint at ≥ 17 km separation under a SHA-1-hashed 0.15° × 0.15° block split with a 2.2 km buffer (src/cryosentinel/data/block_split.py). The split is year-invariant across the 2017, 2021, 2022, and 2023 acquisitions, so no temporal leakage is possible either.

I am not aware of a published model that reports a higher validation IoU on a comparable spatial-block-split, multi-modal protocol for glacial lake segmentation in High Mountain Asia. If you find one, please open an issue.

What's in this release

The training and evaluation code (src/cryosentinel/, scripts/train_terramind.py, scripts/train_terramind_modal.py, scripts/eval_stage4b_diagnostics.py).
Two production configs (configs/terramind_v3_pretrain_v2.yaml, configs/terramind_v3_finetune_almaty_v2.yaml) with inline comments documenting the eleven engineering changes between the v1 and v2 finetune stages — see docs/METHOD.md §3 for the full diff.
Two production checkpoints on Hugging Face under abzal-glw/cryosentinel-terramind-v3:
- terramind_v3_finetune_almaty_v2/checkpoints/soup.ckpt — the production checkpoint, a uniform weight-space average of five SWA snapshots collected from epoch 12 to epoch 30 (Wortsman et al., 2022 / Izmailov et al., 2018). This is what produces every headline number.
- terramind_v3_finetune_almaty_v2/checkpoints/step001605-iou0.952.ckpt — the single best checkpoint, kept for ablation against the soup.
Per-chip diagnostics for the validation and test splits as Parquet files at abzal-glw/cryosentinel-terramind-v3/terramind_v3_finetune_almaty_v2__soup_tta1/diagnostics/.
scripts/reproduce_benchmarks.sh — the single command that reproduces the headline table from a fresh clone.

The training dataset (multimodal chips v3, 5,614 chips for the finetune split, ~ 30 GiB total) is publicly available at abzal-glw/cryosentinel-glof-v3 on Hugging Face Datasets. All upstream data sources are open (Copernicus open-access for Sentinel-1/2 + COPDEM30; PANGAEA CC-BY for the Kumar & Vijay 2026 inventory). See docs/DATA.md for the per-band normalisation statistics and download instructions.

Quickstart

Setup

Use Python 3.11 in a fresh virtual environment:

git clone https://github.com/abzalabdrash/cryosentinel.git
cd Cryosentinel
python -m venv .venv && source .venv/bin/activate   # on Windows: .venv\Scripts\activate
pip install -e ".[dev]"

Inference on released chips

modal run scripts/predict_modal.py::predict \
    --hf-data-repo abzal-glw/cryosentinel-glof-v3 \
    --hf-ckpt-repo abzal-glw/cryosentinel-terramind-v3 \
    --ckpt-uri hf://abzal-glw/cryosentinel-terramind-v3/terramind_v3_finetune_almaty_v2/checkpoints/soup.ckpt \
    --config configs/terramind_v3_finetune_almaty_v2.yaml \
    --region ile_alatau --year 2023 --n-chips 30 \
    --gpu L4

Reproduce the headline table

On Modal H100 (~$3, ~25 minutes):

modal run scripts/eval_stage4b_diagnostics.py::diag \
    --config configs/terramind_v3_finetune_almaty_v2.yaml \
    --ckpt-uri hf://abzal-glw/cryosentinel-terramind-v3/terramind_v3_finetune_almaty_v2/checkpoints/soup.ckpt \
    --run-name reproduce_headline

The full training-from-scratch sequence (Stage 4a pretrain on twelve HMA sub-regions, then Stage 4b finetune on the three Almaty-corridor regions) takes about twelve hours on H100 and costs roughly $50 in cloud credits. See docs/REPRODUCING.md for the exact commands and local-GPU notes.

Method

The architecture is straightforward. The encoder is TerraMind 1.0 Large (1.1 B parameters, dual-scale transformer encoder-decoder pretrained on 9 M spatiotemporally-aligned multimodal samples from the TerraMesh dataset; Jakubik et al., 2025). The decoder is a UperNet head with a 256 → 128 → 64 → 32 channel sequence and a LearnedInterpolateToPyramidal neck. The loss is a weighted combination of binary cross-entropy with pos_weight = 100, a flat Dice term, the per-image Lovász softmax (Berman et al., 2018), and an asymmetric Tversky term (α = 0.25, β = 0.75); we apply OHEM hard-negative mining with keep_ratio = 0.5 and min_kept = 4096. The optimiser is AdamW in two parameter groups, with a backbone learning rate of 5e-6 and a decoder learning rate of 5e-4, both with cosine warm-up. We collect SWA snapshots from 40% of the schedule (epoch 12) through epoch 30 and average them in weight space. EMA (decay 0.999, CPU shadow) is applied at validation and test time. Test-time augmentation is flip-only — we found rotation TTA degrades TerraMind because of how its positional encoders interact with rotated inputs.

The complete hyperparameter table, the eleven v1 → v2 engineering fixes, and a discussion of why each one mattered are in docs/METHOD.md. The total training cost across the v1 mistake, the v2 pretrain, and the v2 finetune was approximately $90 of Modal H100 cloud credits. A robust checkpoint-resume mechanism that lets long-running jobs survive cloud-side interruptions is documented in scripts/train_terramind.py.

Block split

The split is hash-based and year-invariant. Each chip's (latitude, longitude) is binned into a 0.15° × 0.15° block (≈ 17 × 14 km at 43° N), then the block is hashed:

key = f"{salt}|{lat_idx}|{lon_idx}".encode("utf-8")
bucket = int(hashlib.sha1(key).hexdigest()[:8], 16) % 100
split = "train" if bucket < 80 else "val" if bucket < 90 else "test"

with salt = "cryosentinel-blocks-v1". A 0.02° (~ 2.2 km) buffer drops the 4,922 chips that fall within the buffer of a foreign block. The result on the Stage 4b finetune dataset is:

Split	Chips	Share
Train	4,283	76 %
Val	666	12 %
Test	665	12 %

The hash is independent of acquisition year, so the same geographic block stays in the same split across 2017, 2021, 2022, and 2023. There is no path through which a train chip and a test chip can come from the same lake. The implementation is in src/cryosentinel/data/block_split.py and is covered by tests/test_block_split.py.

The model finds lakes that the training labels miss

While auditing the held-out test set, we noticed that the chips with the lowest IoU were not model failures. They were chips where the Kumar & Vijay (2026) ground-truth polygon undersized or completely omitted a lake that the imagery clearly shows.

Seven test chips drove the underperformance on the Ile Alatau region. They are concentrated at two coordinates near 43° N (42.99° N, 76.71° E and 42.92° N, 76.73° E) across 2021, 2022, and 2023 acquisitions. We computed MNDWI = (B03 − B11) / (B03 + B11) directly from the Sentinel-2 raw bands as an independent water indicator. The MNDWI water fraction was 1.6 to 3.6 times larger than the Kumar polygon for all seven chips. Visual inspection (four-panel renders at docs/LABEL_NOISE_AUDIT.md) shows that one of the two glacial lakes visible in the Sentinel-2 RGB is missing entirely from the Kumar polygon at (42.99° N, 76.71° E) for all three years; the model finds both lakes correctly across all three years. Dropping these seven chips (1.05 % of test) raises the global test IoU from 0.8918 to 0.9082 and the Ile Alatau region's IoU from 0.7664 to 0.9285.

This is not a defect. The multi-modal input gives the model three independent physical signals — backscatter (S1), reflectance (S2), and topography (DEM) — that overrule a single noisy label channel. The literature on label-noise robustness (Frenay & Verleysen, 2014; Northcutt et al., 2021) calls this exact pattern: a model trained on K cleaner sources can learn a representation that survives a small fraction of label noise on the K + 1th noisy source. We treat it as a small contribution rather than a problem to hide. The full audit is in docs/LABEL_NOISE_AUDIT.md.

Engineering provenance

The first finetune attempt (v1) plateaued at validation IoU 0.825. The encoder was effectively frozen because three independent multipliers compounded:

freeze_backbone_layers: 6     →  blocks 0-5 hard-frozen
backbone_lr_mult:       0.05  →  ×0.05 on top of LLRD for all encoder blocks
llrd_decay:             0.85  →  block_6 sees 0.85^18 = 0.054 multiplier

⇒ block_6 effective learning rate = 5e-5 × 0.05 × 0.054 = 1.35e-7
⇒ 3,700 × smaller than the decoder learning rate

Block 6 was seeing 0.27 % of the decoder learning rate. The cumulative-LR plot from metrics.csv confirmed it. We unfroze (freeze=0), raised the backbone multiplier to 0.25, and softened the LLRD decay to 0.9. Block 6 now sees 3.8 % of the decoder LR. Combined with re-enabling TTA on the validation metric, dropping label smoothing for the binary-imbalanced setting, switching from generalised Dice to flat Dice, fixing the SWA-never-activates bug (early-stopping was killing the run before the SWA window opened), and tightening the spatial block size from 0.25° to 0.15° to reduce val/test variance, validation IoU moved from 0.825 to 0.9557 — a 13 percentage-point absolute improvement, all from configuration corrections, none from architectural changes.

Eleven such fixes are individually documented as inline comments in configs/terramind_v3_finetune_almaty_v2.yaml and tabulated with their effect sizes in docs/METHOD.md §3. We list them because they are the kind of small, easy-to-miss corrections that deserve to be public when one publishes a model — every team eventually runs into one of these.

Dataset

Property	Value
Source labels	Kumar, R. & Vijay, S. (2026). doi:10.1594/PANGAEA.983845 — 31,698 lakes inventory across HMA, years 2016 and 2022
Imagery	Sentinel-2 L2A (12 bands), Sentinel-1 GRD (VV, VH), Copernicus DEM 30 m
Years (finetune)	2017, 2021, 2022, 2023
Regions (finetune)	Tien Shan (full), Ile Alatau, Zhetysu Alatau
Chips total	5,614 (finetune); 42,237 (full HMA pretrain)
Chip size	224 × 224 px at 10 m/pixel
Pre-filter	We drop ~ 70 % of raw chips whose MNDWI/Kumar agreement is below IoU = 0.20. Of the dropped chips, 82 % have agreement below IoU = 0.05, which is the empirical signature of a misregistered or misdated GT
Augmentation	Spectral jitter (`p = 0.4`, `eps_S2 = 0.03`, `eps_S1 = 0.07`, `eps_DEM = 0.02`), multi-scale [0.8, 1.20] (`p = 0.4`), copy-paste (`p = 0.2`, minimum 32 donor water pixels), hard-negative sampler (positive : negative ratio 3 : 1 with replacement)

Per-band normalisation means and standard deviations were computed via Welford's algorithm on the train split and stored in dataset_stats_v3.json. Full description in docs/DATA.md.

Limitations

CryoSentinel is research-grade. We list what it is not:

It does not predict GLOF events. It segments water bodies on a single satellite snapshot. Breach prediction is a separate problem and lives in the GLOFcast operational system, which scores hazard from a composite of static factors (lake morphometry, dam geometry, downstream slope, glacier proximity), dynamic factors (lake-area trend, glacier velocity, surface temperature trend), and triggers (heatwaves, precipitation, seismicity). Those scores are validated separately.
It is not validated outside High Mountain Asia. Train and test data are from the Tien Shan, Zhetysu Alatau, and Ile Alatau ranges in Kazakhstan and Kyrgyzstan, with the broader pretrain corpus covering twelve HMA sub-regions. We have not run the model on the Andes, Alps, or Caucasus and we do not know how it transfers.
It is not reliable for sub-hectare lakes. The minimum lake area we trust is ~0.5 ha (≈ 50 pixels at 10 m). Smaller water bodies — puddles, thermokarst ponds, irrigation features — are not the target class and may be either missed or false-positively segmented depending on context.
Non-glacial water bodies are intentionally suppressed. During Stage 4a we observed that the Tekeli reservoir in Zhetysu Alatau produces high MNDWI but is correctly absent from the Kumar inventory because it is not glacier-fed. The model learned to suppress it. This is the intended behaviour for an early-warning system that should not alert on routine reservoir operations, but it means the model is not a general water segmenter.
No multi-temporal modelling. The model is a single-snapshot segmenter. Year-on-year change detection is supported by running the model independently on per-year chips and differencing the masks, which is what we do operationally; explicit temporal modelling (e.g. LSTM or TimeSformer over the chip time series) is left for future work.
No uncertainty quantification. We do not currently emit a confidence map. A sensible MC-dropout or deep-ensemble layer is straightforward to add; we have not done it for this release.

docs/LIMITATIONS.md discusses the failure modes we have looked at in more detail, including the four test chips with IoU < 0.05 outside the seven Ile Alatau cases (in tien_shan_full) that we have not yet manually audited.

Reproducibility

Following the spirit of the NeurIPS / Papers with Code reproducibility checklist:

Permissive license (Apache 2.0).
Pinned dependencies (pyproject.toml and Modal image definitions in scripts/train_terramind_modal.py, scripts/eval_stage4b_diagnostics.py, and scripts/predict_modal.py).
Random seeds documented (seed: 42 in every config).
Training code released.
Evaluation code released, including the diagnostics that produced every per-region number above.
Pre-trained weights released on Hugging Face under the same Apache 2.0 license.
Training dataset released on Hugging Face under ODC-By 1.0.
Single-command reproduction of the headline table (scripts/reproduce_benchmarks.sh).
Hardware specification (Modal H100, batch size 8, bf16-mixed, ~ 12 hours total for pretrain + finetune).
Per-chip diagnostics released as Parquet files.
Limitations and failure cases documented above and in docs/LIMITATIONS.md.
Label-noise audit performed and documented in docs/LABEL_NOISE_AUDIT.md.

The one item deferred is multi-seed variance — every reported number is from a single run with seed = 42. v1.0 ships the production-best result; the variance estimate (3 seeds: 17, 42, 1337) is scheduled for v1.1 in June 2026 alongside additional ablations, after the UN Zayed Sustainability Prize 2026 submission is filed. Compute budget is allocated.

How to contribute

Highest-leverage items:

Out-of-domain validation. Run the released checkpoint on the Andes, European Alps, Caucasus, or Patagonia and share per-chip diagnostics. PR adds a row to docs/EXTERNAL_VALIDATION.md. The most useful contribution there is from someone with local field knowledge of one of those ranges.
Reproducibility checks. Re-run scripts/reproduce_benchmarks.sh on your hardware and confirm the headline numbers match to four decimal places. Open an issue if not.
Test-set audit completion. Manually inspect the four tien_shan_full test chips with per-chip IoU < 0.05 (currently unaudited; could be model failures, additional Kumar mislabels, or a mixture). See docs/LIMITATIONS.md § 8.
Temporal modelling extension. A 12-month chip stack with a TimeSformer head over the same TerraMind encoder — this is the v1.2 roadmap. Early experiments and PRs welcomed.

See CONTRIBUTING.md for the workflow.

Compute support

CryoSentinel was built on cloud credits available to a single early-career researcher in Almaty. The v1.0 production results (training, ablations, and the v1 mistake combined) cost roughly $90 of Modal H100 time. v1.1 (multi-seed variance, MC-dropout uncertainty layer, June 2026) and v1.2 (temporal modelling) are budgeted but not abundantly so.

If you represent a GPU cloud provider, an academic-compute program, or a foundation that supports open climate-risk research, even a small grant of H100 / H200 credits would meaningfully accelerate the v1.1 and v1.2 work. Reach out via GitHub issues or the contact link below.

Citation

If CryoSentinel is useful in your research or operational work, please cite:

@software{abdrash2026cryosentinel,
  author    = {Abdrash, Abzal},
  title     = {{CryoSentinel: A Foundation-Model Glacial Lake Segmenter
               for High Mountain Asia}},
  year      = {2026},
  publisher = {Zenodo},
  url       = {https://github.com/abzalabdrash/cryosentinel},
  version   = {v1.0.0},
  doi       = {10.5281/zenodo.20239229}
}

@dataset{abdrash2026cryosentinelglofv3,
  author    = {Abdrash, Abzal},
  title     = {{CryoSentinel-GLOF v3: Multimodal Glacial Lake Chips
               for High Mountain Asia}},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/abzal-glw/cryosentinel-glof-v3},
  doi       = {10.57967/hf/8823}
}

A preprint describing the method, the spatial block split, and the label-noise audit in detail is in preparation and will be posted on EarthArxiv. We will update this section with the EarthArxiv DOI when it is available.

Acknowledgments

IBM Research and ESA Φ-lab for releasing the TerraMind 1.0 weights under a permissive license. Without an open multimodal foundation model the pretrain phase would not have been feasible on the budget available.
Kumar, R. and Vijay, S. for publishing the High Mountain Asia glacial lake inventory on PANGAEA. The 2016 / 2022 dataset is the supervision signal that made this work possible.
Adhikari, P. and Regmi, S. R. (2025) for the Sentinel-1-only baseline against which the multi-modal protocol here is compared. Their "temporal-first" framing of the GLOF early-warning problem is the right one.
Wortsman, M. et al. (2022) for the model soup formulation we use to average SWA snapshots, and Izmailov, P. et al. (2018) for the SWA schedule itself.
Modal Labs and Cerebrium for low-friction access to H100 and L40S compute.
Hugging Face for hosting the model weights, the dataset, and the per-chip diagnostics under a single namespace.
The «Казселезащита» Almaty-oblast mudflow-protection service and the UNESCO GLOFCA programme — the institutional context this work is positioned to complement, not replace. As stated in the affiliation disclosure above, CryoSentinel is not affiliated with either organisation; this acknowledgment recognises their decades of operational work on the same problem.

License

Apache License 2.0 — see LICENSE. The license permits commercial use, modification, distribution, and patent use. Modifications must be marked. The software and weights are provided "AS IS", without warranty of any kind.

The CryoSentinel name, logo, and identity assets are not a grant of endorsement rights; see TRADEMARKS.md and docs/RIGHTS_AND_ATTRIBUTION.md.

Contact

Abzal Abdrash · @abzalabdrash · Almaty, Kazakhstan

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
configs		configs
docs		docs
src/cryosentinel		src/cryosentinel
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
TRADEMARKS.md		TRADEMARKS.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CryoSentinel

What this is

How it works, in plain language

Affiliation disclosure

Headline result

What's in this release

Quickstart

Setup

Inference on released chips

Reproduce the headline table

Method

Block split

The model finds lakes that the training labels miss

Engineering provenance

Dataset

Limitations

Reproducibility

How to contribute

Compute support

Citation

Acknowledgments

License

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CryoSentinel

What this is

How it works, in plain language

Affiliation disclosure

Headline result

What's in this release

Quickstart

Setup

Inference on released chips

Reproduce the headline table

Method

Block split

The model finds lakes that the training labels miss

Engineering provenance

Dataset

Limitations

Reproducibility

How to contribute

Compute support

Citation

Acknowledgments

License

Contact

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages