Skip to content

STAC catalog: parquet schema fixes, morton indexing, spatial matching, and full season YAML coverage#73

Merged
espg merged 29 commits into
mainfrom
fix/parquet-links-schema
Apr 22, 2026
Merged

STAC catalog: parquet schema fixes, morton indexing, spatial matching, and full season YAML coverage#73
espg merged 29 commits into
mainfrom
fix/parquet-links-schema

Conversation

@espg
Copy link
Copy Markdown
Collaborator

@espg espg commented Feb 18, 2026

Summary

This PR grew from a targeted parquet schema fix into a broader update spanning
six areas. The original schema work is the foundation; morton indexing was added
on top to support spatial matching, which itself motivated the crossovers module
and full-coverage YAML refactor.

1. Parquet / STAC-geoparquet schema correctness

  • links column type — when all STAC items have empty links: [], PyArrow
    inferred list<null> (DuckDB reads as INTEGER[]), which crashed stac-wasm
    in stac-map. After parsing, cast to the STAC spec type
    list<struct<href, rel, type>>. Data is unchanged; only the column metadata
    differs. A corresponding bug was also filed and fixed upstream in rustac
    (stac-wasm: arrowToStacJson panics on malformed links column type stac-utils/rustac#959) so stac-wasm now handles malformed schemas more
    gracefully rather than panicking.
  • thumbnailsthumbnail — asset key rename conforming to STAC
    best-practices ([stac-geoparquet] - enabling client-side visualization #64).
  • Schema enforcement — the new struct type is now validated in both tests and
    at catalog-generation time (26468fa).

2. Morton indexing for STAC items and collections (#77 / #78)

  • opr:mbox (4 variable-resolution morton cells) on every STAC item.
  • opr:mpolygon (12 variable-resolution morton cells) on every STAC collection.
  • Implemented in src/xopr/stac/morton.py (+ tests); wired into
    src/xopr/stac/catalog.py. These are the index primitives that make the
    spatial-matching work below possible without polygon intersection.

3. Spatial-matching modules

  • src/xopr/bedmap/morton_match.py — matches Bedmap pick points to
    candidate OPR frames via morton prefix containment, plus along-track
    disambiguator (Determine overlap between Bedmap and OPR data #69, Bedmap2 and OPR fuzzy line matching script #72).
  • src/xopr/crossovers.py — maps xOPR frames to intersecting ICESat-2
    ATL06 granules (Example notebook: map xopr frames to ATL06 granules (xover integration) #86):
    • Vendored minimal CMR client (magg pattern: one query, local index).
    • Two matching backends: shapely.STRtree (exact polygon intersection) and
      morton-prefix (inverted-index lookup).
    • subset_frames_by_points(frames_gdf, points) — pre-filter the catalog to
      frames whose opr:mbox covers any user-supplied point.
    • Flexible temporal windowing: date_range=, cycle=N (ICESat-2
      repeat-cycle, magg-compatible), or mode="exact_year" | "all_years".
  • docs/notebooks/xopr_atl06_crossovers.ipynb — end-to-end worked example
    comparing both backends on a real catalog.

4. Full season YAML coverage

  • All known OPR seasons now have catalog-generation YAMLs under
    seasons/{provider}/ (CReSIS, UTIG, AWI, DTU) — 50 YAML files across both
    hemispheres.
  • Existing flat-directory YAMLs refactored into the seasons/{provider}/ layout
    (122acef).
  • seasons/README.md expanded with the catalog-build workflow and source.coop
    upload instructions.

5. Bug fixes

  • Flight-distance badge (scripts/calc_lines.py) — updated the km line
    badge to work with the AWS subdomain changes introduced by the DuckDB version
    update (11035c3).
  • DuckDB query syntax — adapted bedmap query SQL for the DuckDB version bump
    (ffba973).
  • Bedmap map viewer — fixed the bedmap point layer in polar.html so points
    render again in the docs (f0f3c3c).
  • Geometry simplification — fixed a double-simplification bug in
    stac/catalog.py / stac/metadata.py that was causing mbox to be computed
    from degraded geometry (60d71e2).

6. Housekeeping

  • aggregate_parquet_catalog.py removed (superseded by source.coop workflow).
  • upload_stac_catalogs.py rewritten for source.coop; upload_to_gcloud.sh
    removed.
  • upload_bedmap_to_gcloud.sh refactored.
  • CodSpeed CI workflow (.github/workflows/codspeed.yml) + benchmark harness
    (test_bench_cpu.py).
  • Bedmap converter refactor and obsolete integration test removal.
  • bedmachine_comparison.ipynb removed; crossovers.ipynb updated.

Test plan

  • pytest src/xopr/stac/tests/ — morton, catalog, validation, schema enforcement
  • pytest src/xopr/bedmap/tests/ — morton-match, converter, bedmap catalog
  • pytest src/xopr/tests/test_crossovers.py — 28 tests: STRtree, prefix,
    subset, temporal helpers, cross-backend consistency
  • Rebuild a season parquet and verify:
    • DESCRIBE shows links as STRUCT(href VARCHAR, rel VARCHAR, type VARCHAR)[]
    • Asset key is thumbnail (singular)
    • Items have opr:mbox (4 ints), collection has opr:mpolygon (12 ints)
  • Load rebuilt parquet in stac-map — item-detail view no longer crashes
  • Run xopr_atl06_crossovers.ipynb end-to-end against a Reprocess STAC catalogs with morton indices (mbox/mpolygon) #78 catalog
  • Verify calc_lines.py badge generation against source.coop catalogs
  • Verify bedmap points render in polar.html

Contributes to the #77#78#86 arc.

🤖 Generated with Claude Code

@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Feb 18, 2026

This might also be a good time to think about updating our stac catalog to include preview thumbnails (briefly discussed in #60 ).

If we did add previews/thumbnails , easiest would still be to add the ones that already exist on the OPR / CReSIS data portal. If we do decide to roll our own as part of QAQC for the data, we'll have to host those thumbnails somewhere.

@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Feb 18, 2026

nvm-- we actually already encode the thumbnails from OPR in the catalog. Our custom docs line viewer doesn't display them... but that metadata is already there.

@espg espg changed the title Fix parquet links column schema for empty arrays Parquet catalog schema updates Feb 18, 2026
@weiji14
Copy link
Copy Markdown

weiji14 commented Feb 20, 2026

Do you know how to set the CRS in the parquet file? The default fallback should be OGC:CRS84 according to the geoparquet specification, but I keep having to manually set the CRS when adding a layer like https://data.source.coop/englacial/xopr/catalog/hemisphere=south/provider=cresis/collection=2013_Antarctica_Basler/stac.parquet into QGIS. Not sure if it's a problem on QGIS not defaulting to OGC:CRS84 for geoparquet files, or if it's something that we need to set here.

@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Mar 9, 2026

Plan: Morton-based bedmap-to-frame matching

Building on the morton indices (opr:mbox) that this PR adds to STAC items, here's the plan for using them to match bedmap picks to OPR frames (#69).

Core mechanism: morton prefix containment

Each STAC item's opr:mbox stores 4 morton cells as integers. These integers are prefix strings in disguise — a bedmap point's full-resolution morton index (order 18) "belongs to" an mbox cell if the point's morton string starts with the cell's string:

# mbox cell (from STAC item):  "-6113431"      (coarse prefix)
# bedmap point morton:          "-6113431241314212142"  (full order-18)
# str.startswith("-6113431") → True → candidate match!

This was verified experimentally with mortie — a flight line spanning 2° lat × 5° lon produces mbox prefixes of length 8–10, and point-in-cell containment checks work correctly via string prefix matching.

Algorithm

For a given season (e.g., 2009_Antarctica_DC8NASA_2009_ICEBRIDGE_AIR_BM2):

  1. Load the OPR STAC catalog (parquet) → extract opr:mbox, opr:segment, opr:frame, opr:date per item
  2. Load the bedmap data (parquet) → extract lat/lon point geometry
  3. Convert bedmap points to morton indices: geo2mort(lats, lons, order=18) → array of ints
  4. Convert to strings for prefix matching
  5. Build mbox lookup: {prefix_string → [(segment, date, frame), ...]} from all items × 4 mbox cells
  6. Match: for each bedmap morton string, find all mbox prefixes it starts with → collect candidate (segment, date, frame) tuples
  7. Output: add candidate frame column(s) to the bedmap DataFrame

Why this approach

  • No geometric distance calculations — pure string prefix matching, leveraging the spatial index already built into our STAC extension
  • Vectorizable — numpy/pandas string operations
  • Handles ambiguity naturally — a point near a frame boundary matches multiple items → multiple candidates (correct behavior, post-processing can disambiguate later)
  • Fast — O(n_bedmap × n_items × 4) string comparisons with short strings. Most seasons have ~1000 items × 4 cells = ~4000 prefixes, and bedmap files have ~100k points

Proposed implementation

We'll start with a notebook targeting a specific Tier 1 season (e.g., 2009_Antarctica_DC8NASA_2009_ICEBRIDGE_AIR_BM2) to validate the approach end-to-end. Once verified, we'll either:

  • Add a function to src/xopr/bedmap/ if the computation is fast enough to run on-the-fly, or
  • Precompute and add columns to the hosted bedmap parquet files

Code structure (eventual)

# src/xopr/bedmap/morton_match.py

def match_bedmap_to_frames(stac_catalog_path, bedmap_data_path, order=18):
    """Match bedmap picks to candidate OPR frames via morton prefix containment.
    
    Returns the bedmap DataFrame with added column(s) for candidate
    (segment, date, frame) matches.
    """

def _build_mbox_lookup(stac_gdf):
    """Build dict mapping mbox prefix strings to (segment, date, frame) tuples."""

def _morton_prefix_match(bedmap_mortons, mbox_lookup):
    """For each bedmap morton, find all mbox prefixes it falls within."""

Potential optimization

If brute-force prefix matching is slow, we can sort prefixes and use binary search, or build a trie. But the naive approach should be sufficient for current data volumes.

Related

🤖 Generated with Claude Code

@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Mar 10, 2026

Do you know how to set the CRS in the parquet file? The default fallback should be OGC:CRS84 according to the geoparquet specification, but I keep having to manually set the CRS when adding a layer like https://data.source.coop/englacial/xopr/catalog/hemisphere=south/provider=cresis/collection=2013_Antarctica_Basler/stac.parquet into QGIS. Not sure if it's a problem on QGIS not defaulting to OGC:CRS84 for geoparquet files, or if it's something that we need to set here.

@weiji14 I'm not sure on this-- what is it defaulting to is it's not WGS84 / OGC:CRS84 ? I'm happy to reprocess these if they aren't encoded correctly... if we can get #69 figured out and implemented, it would be a great time to reprocess those files anyway so that we can add frame / segment id columns.

@weiji14
Copy link
Copy Markdown

weiji14 commented Mar 10, 2026

what is it defaulting to is it's not WGS84 / OGC:CRS84 ? I'm happy to reprocess these if they aren't encoded correctly... if we can get #69 figured out and implemented, it would be a great time to reprocess those files anyway so that we can add frame / segment id columns.

QGIS isn't defaulting to anything, just question mark next to the layer that indicates a missing CRS. Again, not sure sure if it's an issue on our side or QGIS.

Maybe good if you have some sample reprocessed files uploaded somewhere we can check?

@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Mar 10, 2026

@thomasteisberg fyi, commit ffba973 fixes the test failures that we're having on main right now. These are from an API change in the newest version of duckdb and likely are causing failures for you as well on #79 (feel free to cherry pick if needed)

Comment thread scripts/build_catalog.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <I001> reported by reviewdog 🐶
Import block is un-sorted or un-formatted

from xopr.stac.config import load_config, save_config, validate_config
from xopr.stac.metadata import discover_campaigns, discover_flight_lines, collect_uniform_metadata
from xopr.stac.catalog import create_items_from_flight_data, create_collection, export_collection_to_parquet
from xopr.stac.geometry import build_collection_extent_and_geometry

@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Apr 20, 2026

Notes / flags from the seasons/*.yml generation pass

Context: generated 41 new season yml from the deployed stac.parquet files on source.coop, fixed the two existing buggy ones (2019_Antarctica_GV, 2016_Antarctica_DC8), and restructured seasons/ into cresis/ utig/ awi/ dtu/ subfolders. Flags worth a second look before merge:

  1. 2019_Antarctica_GV bandwidth — parquet reports opr:bandwidth=18e6, opr:frequency=245e6 → f0=236 MHz, f1=254 MHz. rds_readme.pdf just says "MCoRDS 3 on GV" with no explicit bandwidth; the 245 MHz center is unusual vs. the 195 MHz used on the P3/DC8 variants of MCoRDS 3. Worth sanity-checking against a raw .mat.

  2. Multi-radar seasons use radar.override: false (let pipeline auto-discover from .mat):

    • cresis/2015_Greenland_C130.yml — two configs: 50 MHz @ 205 MHz and 270 MHz @ 315 MHz
    • cresis/2016_Greenland_P3.yml — 5 distinct configs
    • cresis/2017_Antarctica_Basler.yml — MCoRDS (30 MHz @ 195 MHz) + snow/accum (300 MHz @ 300 MHz)
    • dtu/2016_Greenland_TOdtu.yml — 4 configs (~8 MHz @ ~31 MHz plus variants)

    These can't reduce to a single f0/f1 for provenance; leaving as override: false matches intent.

  3. UTIG BaslerMKB (utig/2022_Antarctica_BaslerMKB.yml, utig/2023_Antarctica_BaslerMKB.yml) — used the parquet sci:citation verbatim (the COLDEX grant line: "This work was supported by the Center for Oldest Ice Exploration, an NSF Science and Technology Center (NSF 2019719) and the G. Unger Vetlesen Foundation."). No DOI — one should probably be assigned/added.

  4. cresis/2016_Greenland_G1XB.yml — parquet sci:citation is literally just "NSF OPP-0424589" (grant number, no full citation). Replaced with the standard CReSIS/MCoRDS citation template. Might want a proper G1XB-specific citation.

  5. UTIG 2010–2012 BaslerJKB — DOI 10.5067/0I7PFBVQOGO5 was provided but no accompanying citation text. These catalogs aren't deployed yet so the yml live under utig/pending/ (not staged). sci.citation / sci.doi need to be filled in before processing.

  6. Pending drafts (seasons/utig/pending/, not git-tracked) for the six UTIG seasons listed in Reprocess STAC catalogs with morton indices (mbox/mpolygon) #78 that don't have catalogs yet: 2009, 2010, 2011, 2012, 2015 BaslerJKB; 2014 BaslerMKB. These use radar.override: false as placeholders — fill in f0/f1 when each is processed and move the yml into utig/.

Provider / hemisphere totals match source.coop exactly:

  • cresis: 43 (9 moved + 34 new)
  • utig: 6 (1 moved + 5 new)
  • awi: 1, dtu: 1

@espg espg changed the title Parquet catalog schema updates STAC catalog: parquet schema fixes, morton indexing, spatial matching, and full season YAML coverage Apr 20, 2026
@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Apr 20, 2026

CSARP product coverage audit

Cross-checked each seasons/**/*.yml's primary_product + extra_products against the actual CSARP_* folder listing at https://data.cresis.ku.edu/data/rds/{season}/. The goal is to make sure our STAC catalogs declare every data product that's actually available on the CReSIS server (and don't declare anything that isn't).

What was fixed in this pass

  • 11 adds: CSARP_qlook / CSARP_mvdr added where they're on the server but were missing from the yml (2009_Antarctica_TO, 2011 Antarctica_TO + Greenland_P3, 2012–2014 Greenland_P3, 2017 Antarctica_P3 + Greenland_P3, 2018 Antarctica_DC8 + Greenland_P3, 2019_Antarctica_GV)
  • 1 removal: CSARP_qlook dropped from 2013_Antarctica_Basler (not on server — server only has standard, mvdr, music, layer/layerData)

What remains (not yet addressed)

These are real folders on the CReSIS server that our yml doesn't declare. I left them untouched because some may not be per-frame STAC assets (post-processing outputs, season-level derivatives, thumbnail variants):

yml Missing on yml
cresis/2011_Greenland_P3.yml CSARP_csarp-combined
cresis/2012_Greenland_P3.yml CSARP_csarp-combined
cresis/2014_Greenland_P3.yml CSARP_DEM, CSARP_movies, CSARP_movies_HQ, CSARP_music3D, CSARP_surfData
cresis/2017_Greenland_P3.yml CSARP_post
cresis/2024_Antarctica_GroundGHOST2.yml CSARP_DEM, CSARP_small_jpg, CSARP_small_mat
utig/2022_Antarctica_BaslerMKB.yml CSARP_small_jpg, CSARP_small_mat
utig/2023_Antarctica_BaslerMKB.yml CSARP_small_jpg, CSARP_small_mat

Notes on each:

  • CSARP_csarp-combined — combined-aperture product, per-frame. Probably belongs in the yml.
  • CSARP_DEM — digital elevation model. May or may not be per-frame — worth checking.
  • CSARP_movies / CSARP_movies_HQ — animation/derivative renders, possibly season-level rather than per-frame.
  • CSARP_music3D — MUSIC 3D beamforming output, per-frame.
  • CSARP_surfData — ice surface detection, per-frame.
  • CSARP_post — post-processing folder that contains pdf/ + kml_good/ subfolders, not per-frame data. Probably skip.
  • CSARP_small_jpg / CSARP_small_mat — look like compressed/thumbnail variants of the standard output, per-frame.

Layer-variant special case

Three seasons (awi/2016_Greenland_Polar6, cresis/2017_Antarctica_Basler, dtu/2016_Greenland_TOdtu) have a CSARP_la/ folder on the server that contains both layer_*.mat and Data_*.mat files — a layer-variant in the same class as CSARP_layer / CSARP_layerData. Flagged under the existing layer-variant rule, not included in the mismatch list.

Summary

  • 12 seasons updated ✅
  • 7 seasons still flagged for your judgment (list above)
  • 38 seasons match the server exactly

@espg espg requested a review from thomasteisberg April 20, 2026 19:20
@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Apr 20, 2026

@thomasteisberg see the above comment and let me know which (if any) should get added as 'extra products' for those seasons. Also, have a look at the yml's and let me know if you need anything changed (I assume that we'll want to pay special attention to the utig ones).

Once you review this, I'll bulk rerun the catalog generation and upload-- this should be good to merge after that!

@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Apr 20, 2026

As a side note, our single DTU and AWI yml's almost certainly need to be updated for the citations and doi information at a minimum...

Copy link
Copy Markdown
Member

@thomasteisberg thomasteisberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me from a first pass! Thanks for doing this -- looks like a mountain of work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a UAV test season. I don't think there's good data here, but I'm curious to see what it looks like, so let's include it!

@espg espg added preview-docs run-benchmarks Trigger CodSpeed benchmarks labels Apr 22, 2026
@espg
Copy link
Copy Markdown
Collaborator Author

espg commented Apr 22, 2026

/preview

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

📚 Documentation Preview Ready!

🌐 Live Preview

URL: https://xopr-pr-73.surge.sh/xopr/

Preview updates automatically with new commits while the preview-docs label is present.

📦 Download Artifact

You can also download the docs for local viewing.

Commit: ac27285


To trigger a preview, add the preview-docs label or use the /preview command.

@espg espg merged commit 57d2d1c into main Apr 22, 2026
5 checks passed
@espg espg deleted the fix/parquet-links-schema branch April 22, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview-docs run-benchmarks Trigger CodSpeed benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants