Bedmap2 and OPR fuzzy line matching script#72
Conversation
Find closely matching Bedmap2 and OPR radar lines on a segment by segment basis. Uses hausdorff distance as the fuzzy matching metric. Intended outputs are the Bedmap IDs that have a low hausdorff distance (based on some threshold) to the OPR line segments, which indicate a close enough fuzzy match.
| # Report Bedmap IDs and their hausdorff distance to nearest OPR line segment | ||
| print(gdf_bedmap_[["id", "hausdorff_dist"]].sort_values(by="hausdorff_dist")) |
There was a problem hiding this comment.
Example output from year 2002, sorted by ascending hausdorff distance:
id hausdorff_dist
52 Data_20021126_01_002 3.026301e+02
50 Data_20021126_01_001 1.088522e+03
37 Data_20021210_01_003 1.501179e+04
36 Data_20021210_01_002 1.588442e+04
48 Data_20021210_01_014 2.650424e+04
.. ... ...
4 Data_20021212_01_005 1.499097e+06
54 Data_20021126_01_004 1.619117e+06
15 Data_20021206_01_001 1.772265e+06
44 Data_20021210_01_010 2.720349e+06
46 Data_20021210_01_012 2.828567e+06
[64 rows x 2 columns]
We'll need to set a threshold somewhere to determine whether there is a overlap/match or not. Best if the bedmap2 linestrings are not simplified to enable setting a low threshold (<100km or so?).
Accidentally got confused when naming the bedmap catalog opr and vice versa, so renaming them to be correct.
Convenience function to read parquet files from remote object storage using obstore and geopandas.
Replace the previous hausdorff distance based fuzzy matching algorithm with a more exact method that relies on access to the dense BedMAP XY points. The algorithm works by looping through each OPR line segment, and finds the corresponding series of points in the BedMAP database using some distance tolerance. There are quite a few hardcoded heuristics/assumptions, so need to check more thoroughly on all years.
Refine algorithm to work on years besides 2002 that are less clean. Need to sort the BedMap points by timestamp besides the OPR lines now, and we can't assume trajectory_id is unique so just use the row index. Added an extra check to skip trying a different tolerance when most points are already too far (>200m) away from the line segment. Also skipping cases where there is only one point matching, should be at least 2 points to form a line.
Do a fast check of the dense BedMap points against sparse OPR points first, and if they match great! If not, download the dense OPR points (in .mat format) from CReSIS and redo the distance-based matching.
Helps to label some cases where flight lines are parallel and quite close to each other spatially, and happens in close enough sequence ids.
Running backwards in time from 2019 to 2002 now, starting with the BedMap3 archive.
Replacing the slow exact distance match (which requires pulling in the CReSIS .mat files) with a temporal match instead that is a bit faster. Still missing lots of cases where initial distance match didn't work (because sparse OPR lines are too far from the dense BedMap points). Partially reverts 6c4e57e
Use row-based delta in terms of distance or time to refine head and tail indexes for matching OPR segments to BedMAP points. Specifically, using pd.Series.pct_change for spatial, and pd.Series.diff for temporal large diff finding between subsequent rows. Still need to fix some off-by-one errors though.
Somewhat reverting 7dd38a9, because temporal method is not so good for the super tricky cases where flightlines may be a day or so apart. Now using h5py instead of scipy.io to read the .mat files. Changed some of the tolerance values and variable names to fit recent code changes.
Catch edge case when df_dist_delta > 200 returns no results, causing an IndexError in the head/tail shift logic. Let things jump straight to the slow dense point checking fallback.
0c24633 to
de5a417
Compare
Manual match-case to match OPR collections to the relevant BedMAP campaign.
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| mat_fpath = os.path.basename(p := url) # e.g. Data_20101026_01_001.mat | ||
| file_name = os.path.join(tmpdir, mat_fpath) | ||
| urllib.request.urlretrieve(url=p, filename=file_name) # download to tempfile | ||
| dat = h5py.File(name=file_name) |
There was a problem hiding this comment.
Hitting into some OSError: Unable to synchronously open file (file signature not found) errors when trying to open some of the *.mat files using h5py (e.g. https://data.cresis.ku.edu/data/rds/2013_Antarctica_Basler/CSARP_standard/20131219_02/Data_20131219_02_005.mat). Might be some corrupted files on the CReSIS servers?
There was a problem hiding this comment.
Ah ok, so that file is a MATLAB 5.0 MAT-file, which h5py can't read because it is not a HDF5 file (Only MATLAB 7.3 and above are HDF5 according to https://www.mathworks.com/help/matlab/import_export/mat-file-versions.html). Need to use scipy.io.loadmat instead as a fallback.
The 2009_Antarctica_TO_Gambit campaign over the Gamburtsev Mountains doesn't overlap with the BedMap3 lines, so removing. 2009_Antartica_TO over Thwaites and Pine Island Glacier overlaps with both CRESIS_2009_AntarcticaTO_AIR_BM3 & CRESIS_2009_Thwaites_AIR_BM3 though.
|
FYI, all the xopr stac catalogs have the mbox polygon coverage now (if that's helpful)-- also, we've merged #73 and updated to 0.5.0 >.> |
Fix `SyntaxError: 'await' outside function` by properly wrapping things in asyncio.run. Also added some extra print statements to indicate main sections of the script.
005df6b to
7511e88
Compare
Timestamps are only valid for BedMap3 and two BedMap2 seasons (2011/2012). The other BedMap2 seasons (2002, 2004, 2009, 2010) use interpolated timestamps so can't use the temporal filter reliably.
Tsutaki et al. 2022, https://doi.org/10.5194/tc-16-2967-2022. Unsure why this Japanese expedition data is hosted under CReSIS, but gotta handle it I suppose.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Calculate percentage and count of classified (Bedmap matches OPR) vs unclassified (no matches) points in the Geopackage files.
What I am changing
How I did it
Using hausdorff distance function (geopandas.Series.hausdorff_distance) as basis for fuzzy matching close enough linestringsTODO:
How you can test it
scripts/bedmap_overlap_opr.pyRelated Issues
#69
Fun fact: I used this hausdorff distance metric in a previous GIS job (10 years ago) for matching road segments when new data would come in from city councils, and we need to find where the new roads were compared to the old ones 😃superseded