Skip to content

Bedmap2 and OPR fuzzy line matching script#72

Draft
weiji14 wants to merge 22 commits into
englacial:mainfrom
weiji14:bedmap_overlap_opr
Draft

Bedmap2 and OPR fuzzy line matching script#72
weiji14 wants to merge 22 commits into
englacial:mainfrom
weiji14:bedmap_overlap_opr

Conversation

@weiji14
Copy link
Copy Markdown

@weiji14 weiji14 commented Feb 10, 2026

What I am changing

  • Find closely matching Bedmap2 and OPR radar lines on a segment by segment basis

How I did it

TODO:

  • Initial implementation
  • Use non-simplified Bedmap linestrings for more exact matching
  • Set threshold to determine whether a Bedmap linestring matches with an OPR linestring
  • Report Bedmap IDs which overlap or are duplicates of the OPR lines
  • Create a utility function to skip reading Bedmap IDs that are known to be duplicates?

How you can test it

  • See scripts/bedmap_overlap_opr.py

Related Issues

#69

Fun fact: I used this hausdorff distance metric in a previous GIS job (10 years ago) for matching road segments when new data would come in from city councils, and we need to find where the new roads were compared to the old ones 😃 superseded

Find closely matching Bedmap2 and OPR radar lines on a segment by segment basis. Uses hausdorff distance as the fuzzy matching metric. Intended outputs are the Bedmap IDs that have a low hausdorff distance (based on some threshold) to the OPR line segments, which indicate a close enough fuzzy match.
Comment thread scripts/bedmap_overlap_opr.py Outdated
Comment on lines +103 to +104
# Report Bedmap IDs and their hausdorff distance to nearest OPR line segment
print(gdf_bedmap_[["id", "hausdorff_dist"]].sort_values(by="hausdorff_dist"))
Copy link
Copy Markdown
Author

@weiji14 weiji14 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example output from year 2002, sorted by ascending hausdorff distance:

                      id  hausdorff_dist
52  Data_20021126_01_002    3.026301e+02
50  Data_20021126_01_001    1.088522e+03
37  Data_20021210_01_003    1.501179e+04
36  Data_20021210_01_002    1.588442e+04
48  Data_20021210_01_014    2.650424e+04
..                   ...             ...
4   Data_20021212_01_005    1.499097e+06
54  Data_20021126_01_004    1.619117e+06
15  Data_20021206_01_001    1.772265e+06
44  Data_20021210_01_010    2.720349e+06
46  Data_20021210_01_012    2.828567e+06

[64 rows x 2 columns]

We'll need to set a threshold somewhere to determine whether there is a overlap/match or not. Best if the bedmap2 linestrings are not simplified to enable setting a low threshold (<100km or so?).

Accidentally got confused when naming the bedmap catalog opr and vice versa, so renaming them to be correct.
Convenience function to read parquet files from remote object storage using obstore and geopandas.
Replace the previous hausdorff distance based fuzzy matching algorithm with a more exact method that relies on access to the dense BedMAP XY points. The algorithm works by looping through each OPR line segment, and finds the corresponding series of points in the BedMAP database using some distance tolerance. There are quite a few hardcoded heuristics/assumptions, so need to check more thoroughly on all years.
@weiji14 weiji14 changed the title Bedmap2 and OPR fuzzy line matching script using hausdorff distance Bedmap2 and OPR fuzzy line matching script Feb 18, 2026
Refine algorithm to work on years besides 2002 that are less clean. Need to sort the BedMap points by timestamp besides the OPR lines now, and we can't assume trajectory_id is unique so just use the row index. Added an extra check to skip trying a different tolerance when most points are already too far (>200m) away from the line segment. Also skipping cases where there is only one point matching, should be at least 2 points to form a line.
Do a fast check of the dense BedMap points against sparse OPR points first, and if they match great! If not, download the dense OPR points (in .mat format) from CReSIS and redo the distance-based matching.
Helps to label some cases where flight lines are parallel and quite close to each other spatially, and happens in close enough sequence ids.
Running backwards in time from 2019 to 2002 now, starting with the BedMap3 archive.
Replacing the slow exact distance match (which requires pulling in the CReSIS .mat files) with a temporal match instead that is a bit faster. Still missing lots of cases where initial distance match didn't work (because sparse OPR lines are too far from the dense BedMap points). Partially reverts 6c4e57e
weiji14 added 4 commits March 19, 2026 15:42
Use row-based delta in terms of distance or time to refine head and tail indexes for matching OPR segments to BedMAP points. Specifically, using pd.Series.pct_change for spatial, and pd.Series.diff for temporal large diff finding between subsequent rows. Still need to fix some off-by-one errors though.
Somewhat reverting 7dd38a9, because temporal method is not so good for the super tricky cases where flightlines may be a day or so apart. Now using h5py instead of scipy.io to read the .mat files. Changed some of the tolerance values and variable names to fit recent code changes.
Catch edge case when df_dist_delta > 200 returns no results, causing an IndexError in the head/tail shift logic. Let things jump straight to the slow dense point checking fallback.
@weiji14 weiji14 force-pushed the bedmap_overlap_opr branch from 0c24633 to de5a417 Compare March 27, 2026 04:32
Mostly needed for the 2016 campaign where some segments are trying to match with 2 million + points, making the distance calculations super slow. Using a +/-2 day temporal filter as in 7dd38a9 but not applying the inflexion point method from 24aa9c3.
Manual match-case to match OPR collections to the relevant BedMAP campaign.
Comment thread scripts/bedmap_overlap_opr.py Outdated
Comment on lines +54 to +58
with tempfile.TemporaryDirectory() as tmpdir:
mat_fpath = os.path.basename(p := url) # e.g. Data_20101026_01_001.mat
file_name = os.path.join(tmpdir, mat_fpath)
urllib.request.urlretrieve(url=p, filename=file_name) # download to tempfile
dat = h5py.File(name=file_name)
Copy link
Copy Markdown
Author

@weiji14 weiji14 Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hitting into some OSError: Unable to synchronously open file (file signature not found) errors when trying to open some of the *.mat files using h5py (e.g. https://data.cresis.ku.edu/data/rds/2013_Antarctica_Basler/CSARP_standard/20131219_02/Data_20131219_02_005.mat). Might be some corrupted files on the CReSIS servers?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so that file is a MATLAB 5.0 MAT-file, which h5py can't read because it is not a HDF5 file (Only MATLAB 7.3 and above are HDF5 according to https://www.mathworks.com/help/matlab/import_export/mat-file-versions.html). Need to use scipy.io.loadmat instead as a fallback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c12268

weiji14 added 2 commits April 21, 2026 12:20
Older CReSIS files (<=2014?) were written as MATLAB 5, while newer ones are MATLAB 7, so need to use either scipy.io.loadmat or h5py.File to read the arrays.
Combine 6c4e57e and f6bf08f.
The 2009_Antarctica_TO_Gambit campaign over the Gamburtsev Mountains doesn't overlap with the BedMap3 lines, so removing. 2009_Antartica_TO over Thwaites and Pine Island Glacier overlaps with both CRESIS_2009_AntarcticaTO_AIR_BM3 & CRESIS_2009_Thwaites_AIR_BM3 though.
@espg
Copy link
Copy Markdown
Collaborator

espg commented Apr 22, 2026

FYI, all the xopr stac catalogs have the mbox polygon coverage now (if that's helpful)-- also, we've merged #73 and updated to 0.5.0 >.>

Fix `SyntaxError: 'await' outside function` by properly wrapping things in asyncio.run. Also added some extra print statements to indicate main sections of the script.
@weiji14 weiji14 force-pushed the bedmap_overlap_opr branch from 005df6b to 7511e88 Compare April 27, 2026 08:55
weiji14 added 3 commits April 27, 2026 21:16
Timestamps are only valid for BedMap3 and two BedMap2 seasons (2011/2012). The other BedMap2 seasons (2002, 2004, 2009, 2010) use interpolated timestamps so can't use the temporal filter reliably.
Tsutaki et al. 2022, https://doi.org/10.5194/tc-16-2967-2022. Unsure why this Japanese expedition data is hosted under CReSIS, but gotta handle it I suppose.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Calculate percentage and count of classified (Bedmap matches OPR) vs unclassified (no matches) points in the Geopackage files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants