STAC catalog: parquet schema fixes, morton indexing, spatial matching, and full season YAML coverage#73
Conversation
|
This might also be a good time to think about updating our stac catalog to include preview thumbnails (briefly discussed in #60 ). If we did add previews/thumbnails , easiest would still be to add the ones that already exist on the OPR / CReSIS data portal. If we do decide to roll our own as part of QAQC for the data, we'll have to host those thumbnails somewhere. |
|
nvm-- we actually already encode the thumbnails from OPR in the catalog. Our custom docs line viewer doesn't display them... but that metadata is already there. |
|
Do you know how to set the CRS in the parquet file? The default fallback should be OGC:CRS84 according to the geoparquet specification, but I keep having to manually set the CRS when adding a layer like https://data.source.coop/englacial/xopr/catalog/hemisphere=south/provider=cresis/collection=2013_Antarctica_Basler/stac.parquet into QGIS. Not sure if it's a problem on QGIS not defaulting to OGC:CRS84 for geoparquet files, or if it's something that we need to set here. |
Plan: Morton-based bedmap-to-frame matchingBuilding on the morton indices ( Core mechanism: morton prefix containmentEach STAC item's # mbox cell (from STAC item): "-6113431" (coarse prefix)
# bedmap point morton: "-6113431241314212142" (full order-18)
# str.startswith("-6113431") → True → candidate match!This was verified experimentally with AlgorithmFor a given season (e.g.,
Why this approach
Proposed implementationWe'll start with a notebook targeting a specific Tier 1 season (e.g.,
Code structure (eventual)# src/xopr/bedmap/morton_match.py
def match_bedmap_to_frames(stac_catalog_path, bedmap_data_path, order=18):
"""Match bedmap picks to candidate OPR frames via morton prefix containment.
Returns the bedmap DataFrame with added column(s) for candidate
(segment, date, frame) matches.
"""
def _build_mbox_lookup(stac_gdf):
"""Build dict mapping mbox prefix strings to (segment, date, frame) tuples."""
def _morton_prefix_match(bedmap_mortons, mbox_lookup):
"""For each bedmap morton, find all mbox prefixes it falls within."""Potential optimizationIf brute-force prefix matching is slow, we can sort prefixes and use binary search, or build a trie. But the naive approach should be sufficient for current data volumes. Related
🤖 Generated with Claude Code |
@weiji14 I'm not sure on this-- what is it defaulting to is it's not WGS84 / OGC:CRS84 ? I'm happy to reprocess these if they aren't encoded correctly... if we can get #69 figured out and implemented, it would be a great time to reprocess those files anyway so that we can add frame / segment id columns. |
QGIS isn't defaulting to anything, just question mark next to the layer that indicates a missing CRS. Again, not sure sure if it's an issue on our side or QGIS. Maybe good if you have some sample reprocessed files uploaded somewhere we can check? |
|
@thomasteisberg fyi, commit ffba973 fixes the test failures that we're having on main right now. These are from an API change in the newest version of duckdb and likely are causing failures for you as well on #79 (feel free to cherry pick if needed) |
Notes / flags from the
|
CSARP product coverage auditCross-checked each What was fixed in this pass
What remains (not yet addressed)These are real folders on the CReSIS server that our yml doesn't declare. I left them untouched because some may not be per-frame STAC assets (post-processing outputs, season-level derivatives, thumbnail variants):
Notes on each:
Layer-variant special caseThree seasons ( Summary
|
|
@thomasteisberg see the above comment and let me know which (if any) should get added as 'extra products' for those seasons. Also, have a look at the yml's and let me know if you need anything changed (I assume that we'll want to pay special attention to the utig ones). Once you review this, I'll bulk rerun the catalog generation and upload-- this should be good to merge after that! |
|
As a side note, our single DTU and AWI yml's almost certainly need to be updated for the citations and doi information at a minimum... |
thomasteisberg
left a comment
There was a problem hiding this comment.
Looks good to me from a first pass! Thanks for doing this -- looks like a mountain of work.
There was a problem hiding this comment.
This is a UAV test season. I don't think there's good data here, but I'm curious to see what it looks like, so let's include it!
|
/preview |
📚 Documentation Preview Ready!🌐 Live PreviewURL: https://xopr-pr-73.surge.sh/xopr/ Preview updates automatically with new commits while the 📦 Download ArtifactYou can also download the docs for local viewing. Commit: To trigger a preview, add the |
Summary
This PR grew from a targeted parquet schema fix into a broader update spanning
six areas. The original schema work is the foundation; morton indexing was added
on top to support spatial matching, which itself motivated the crossovers module
and full-coverage YAML refactor.
1. Parquet / STAC-geoparquet schema correctness
linkscolumn type — when all STAC items have emptylinks: [], PyArrowinferred
list<null>(DuckDB reads asINTEGER[]), which crashedstac-wasmin stac-map. After parsing, cast to the STAC spec type
list<struct<href, rel, type>>. Data is unchanged; only the column metadatadiffers. A corresponding bug was also filed and fixed upstream in rustac
(stac-wasm: arrowToStacJson panics on malformed
linkscolumn type stac-utils/rustac#959) sostac-wasmnow handles malformed schemas moregracefully rather than panicking.
thumbnails→thumbnail— asset key rename conforming to STACbest-practices ([stac-geoparquet] - enabling client-side visualization #64).
at catalog-generation time (
26468fa).2. Morton indexing for STAC items and collections (#77 / #78)
opr:mbox(4 variable-resolution morton cells) on every STAC item.opr:mpolygon(12 variable-resolution morton cells) on every STAC collection.src/xopr/stac/morton.py(+ tests); wired intosrc/xopr/stac/catalog.py. These are the index primitives that make thespatial-matching work below possible without polygon intersection.
3. Spatial-matching modules
src/xopr/bedmap/morton_match.py— matches Bedmap pick points tocandidate OPR frames via morton prefix containment, plus along-track
disambiguator (Determine overlap between Bedmap and OPR data #69, Bedmap2 and OPR fuzzy line matching script #72).
src/xopr/crossovers.py— maps xOPR frames to intersecting ICESat-2ATL06 granules (Example notebook: map xopr frames to ATL06 granules (xover integration) #86):
shapely.STRtree(exact polygon intersection) andmorton-prefix (inverted-index lookup).
subset_frames_by_points(frames_gdf, points)— pre-filter the catalog toframes whose
opr:mboxcovers any user-supplied point.date_range=,cycle=N(ICESat-2repeat-cycle, magg-compatible), or
mode="exact_year" | "all_years".docs/notebooks/xopr_atl06_crossovers.ipynb— end-to-end worked examplecomparing both backends on a real catalog.
4. Full season YAML coverage
seasons/{provider}/(CReSIS, UTIG, AWI, DTU) — 50 YAML files across bothhemispheres.
seasons/{provider}/layout(
122acef).seasons/README.mdexpanded with the catalog-build workflow and source.coopupload instructions.
5. Bug fixes
scripts/calc_lines.py) — updated thekm linebadge to work with the AWS subdomain changes introduced by the DuckDB version
update (
11035c3).(
ffba973).polar.htmlso pointsrender again in the docs (
f0f3c3c).stac/catalog.py/stac/metadata.pythat was causing mbox to be computedfrom degraded geometry (
60d71e2).6. Housekeeping
aggregate_parquet_catalog.pyremoved (superseded by source.coop workflow).upload_stac_catalogs.pyrewritten for source.coop;upload_to_gcloud.shremoved.
upload_bedmap_to_gcloud.shrefactored..github/workflows/codspeed.yml) + benchmark harness(
test_bench_cpu.py).bedmachine_comparison.ipynbremoved;crossovers.ipynbupdated.Test plan
pytest src/xopr/stac/tests/— morton, catalog, validation, schema enforcementpytest src/xopr/bedmap/tests/— morton-match, converter, bedmap catalogpytest src/xopr/tests/test_crossovers.py— 28 tests: STRtree, prefix,subset, temporal helpers, cross-backend consistency
DESCRIBEshowslinksasSTRUCT(href VARCHAR, rel VARCHAR, type VARCHAR)[]thumbnail(singular)opr:mbox(4 ints), collection hasopr:mpolygon(12 ints)xopr_atl06_crossovers.ipynbend-to-end against a Reprocess STAC catalogs with morton indices (mbox/mpolygon) #78 catalogcalc_lines.pybadge generation against source.coop catalogspolar.htmlContributes to the #77 → #78 → #86 arc.
🤖 Generated with Claude Code