This is an enhancement of #37 . Right now in xopr, we can query layer data from OPR, and bedmap data from bedmap (versions 1, 2, and 3). These two datasets are overlapping; some of the data in bedmap is data from OPR partners, and is present in the layer data. However, this isn't a sub/superset case; there are cases where the layer data from OPR hasn't been incorporated to bedmap, or, where that data may have been subsampled (so that some but not all of the bed picks exist in bedmap). Similarly, there's lots of data in bedmap that isn't present in OPR-- especially for campaigns that don't release echograms (e.g., PRC flight lines).
We want the ability to compare and combine these data intelligently. A driving use case is building a statistical model of the bed topography-- we want as much data as possible for the the interpolation / GP realizations, but we don't want to repeat and double the input measurements where they overlap.
We'd also like to be able to link explicitly between measurements where they are represented in both cases. Use cases would be a.) to identify if any of the values are different between the datasets that are derived from the same field campaign (so we can investigate why), and, b.) the ability to grab an echogram of the radar data that a bed pick was generated from. Right now we can do this for OPR data-- that is, link from layer to echogram or echogram to layer. The new functionality would be the ability to link from a Bedmap pick to an echogram (when we have it).
Computationally, the best pipeline is probably for us to do a combination of both campaign matching and space-time snapping for this. Some of the data from Bedmap 2 and 3 is relatively well behaved (i.e., the input CVS data is for a season that can be matched to the same season in OPR), so it's campaign-to-campaign match, and then matching from bedmap to that OPR data using the time and location information. In the best case, we might even be able to use just time, when the bedmap data column includes something like GPS time. The caveat is that the bedmap data is pretty variable in quality; there are cases when the time column just has a year or day, and then we're back to using spatial distance calculations.
Bedmap 1 is awful; it's a single CVS file that doesn't identify which rows come from which campaigns. Fortunately, bedmap1 is confined to data from 2000 and earlier, which means there's only a few seasons in OPR to check against. There's also plenty of cases where we don't have all the variables filled in... i.e., only a surface or bed elevation, so we don't have thickness, or just thickness. These tend to be the case more in earlier seasons (i.e., most of Bedmap1). Ideally, we'd like to solve this problem for all of the data, but we could iterate to a fast 90% solution by ignoring Bedmap1 (to start).
We'll need functions for this, although it's unlikely that users will use them; it will probably make sense for us to reprocess the bedmap geoparquet files and add a few columns that users can query on instead. However, since we're still expanding both the OPR and xopr data holdings, this also means that we'll need a pipeline we can rerun from time to time to update things. While we could look at adding data to the layer files, it will make most sense to modify the geoparquet data for bedmap instead since we have unrestricted access to update those (and don't have write access to the OPR layer data files).
This is an enhancement of #37 . Right now in xopr, we can query layer data from OPR, and bedmap data from bedmap (versions 1, 2, and 3). These two datasets are overlapping; some of the data in bedmap is data from OPR partners, and is present in the layer data. However, this isn't a sub/superset case; there are cases where the layer data from OPR hasn't been incorporated to bedmap, or, where that data may have been subsampled (so that some but not all of the bed picks exist in bedmap). Similarly, there's lots of data in bedmap that isn't present in OPR-- especially for campaigns that don't release echograms (e.g., PRC flight lines).
We want the ability to compare and combine these data intelligently. A driving use case is building a statistical model of the bed topography-- we want as much data as possible for the the interpolation / GP realizations, but we don't want to repeat and double the input measurements where they overlap.
We'd also like to be able to link explicitly between measurements where they are represented in both cases. Use cases would be a.) to identify if any of the values are different between the datasets that are derived from the same field campaign (so we can investigate why), and, b.) the ability to grab an echogram of the radar data that a bed pick was generated from. Right now we can do this for OPR data-- that is, link from layer to echogram or echogram to layer. The new functionality would be the ability to link from a Bedmap pick to an echogram (when we have it).
Computationally, the best pipeline is probably for us to do a combination of both campaign matching and space-time snapping for this. Some of the data from Bedmap 2 and 3 is relatively well behaved (i.e., the input CVS data is for a season that can be matched to the same season in OPR), so it's campaign-to-campaign match, and then matching from bedmap to that OPR data using the time and location information. In the best case, we might even be able to use just time, when the bedmap data column includes something like GPS time. The caveat is that the bedmap data is pretty variable in quality; there are cases when the time column just has a year or day, and then we're back to using spatial distance calculations.
Bedmap 1 is awful; it's a single CVS file that doesn't identify which rows come from which campaigns. Fortunately, bedmap1 is confined to data from 2000 and earlier, which means there's only a few seasons in OPR to check against. There's also plenty of cases where we don't have all the variables filled in... i.e., only a surface or bed elevation, so we don't have thickness, or just thickness. These tend to be the case more in earlier seasons (i.e., most of Bedmap1). Ideally, we'd like to solve this problem for all of the data, but we could iterate to a fast 90% solution by ignoring Bedmap1 (to start).
We'll need functions for this, although it's unlikely that users will use them; it will probably make sense for us to reprocess the bedmap geoparquet files and add a few columns that users can query on instead. However, since we're still expanding both the OPR and xopr data holdings, this also means that we'll need a pipeline we can rerun from time to time to update things. While we could look at adding data to the layer files, it will make most sense to modify the geoparquet data for bedmap instead since we have unrestricted access to update those (and don't have write access to the OPR layer data files).