Restrict possible multi-file reading errors.#1507
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds defensive checks to reduce silent data corruption when (1) resampling time series with irregular gaps, and (2) reading multi-file xarray datasets where resolved {year}/{month} placeholders don’t match the opened dataset’s time coverage. It also blocks an unsafe xarray.open_mfdataset(parallel=True, lock=False) configuration under threaded Dask schedulers.
Changes:
- Add
require_monotonic=True(default) toresample_timeto reject irregular time steps before resampling. - Track resolved
{year}/{month}placeholders inConventionResolverand validate multi-file completeness against the opened dataset’s time coordinate. - Validate xarray driver kwargs to reject
parallel=True+lock=Falseunder threaded Dask schedulers; add tests + docs/changelog updates.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
hydromt/model/processes/meteo.py |
Adds monotonic/regular-time requirement before resampling and time-delta helpers. |
hydromt/data_catalog/uri_resolvers/convention_resolver.py |
Normalizes Windows paths, logs resolved URIs in debug, and stores resolved placeholders. |
hydromt/data_catalog/sources/data_source.py |
Adds multi-file completeness check comparing resolved placeholder periods to dataset time coordinate. |
hydromt/data_catalog/sources/dataset.py |
Calls multi-file completeness check for convention-resolved datasets. |
hydromt/data_catalog/sources/geodataset.py |
Calls multi-file completeness check for convention-resolved geodatasets; fixes docstring typos. |
hydromt/data_catalog/sources/rasterdataset.py |
Calls multi-file completeness check for convention-resolved raster datasets. |
hydromt/data_catalog/drivers/xarray_options.py |
Rejects unsafe parallel=True + lock=False under threaded Dask schedulers. |
tests/model/processes/test_meteo.py |
Adds regression test for resampling rejection on irregular time-step gaps. |
tests/data_catalog/uri_resolvers/test_convention_resolver.py |
Adds tests for debug logging of expanded URIs and Windows path normalization. |
tests/data_catalog/sources/test_common_functionality.py |
Adds tests for multifile placeholder-vs-time completeness checks. |
tests/data_catalog/drivers/raster/test_raster_xarray_driver.py |
Adds tests for Dask scheduler validation around parallel/lock handling. |
docs/user_guide/data_catalog/data_types.rst |
Documents xarray driver option validation behavior. |
docs/user_guide/data_catalog/data_prepare_cat.rst |
Documents multi-file placeholder completeness checks. |
docs/changelog.rst |
Records behavior changes in resampling, driver option validation, and multifile completeness checks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
deltamarnix
left a comment
There was a problem hiding this comment.
I would like to see some changes in the way that these functions are called, but I agree with the overall construct of the erroring.
Co-authored-by: Marnix <150045289+deltamarnix@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
LuukBlom
left a comment
There was a problem hiding this comment.
LGTM! nice job
I have 1 nit that you can decide to implement or not.
Issue addressed
Fixes Deltares/hydromt_wflow#741
Explanation
While we have not been able to reproduce the issue, we know it is caused by dropped chunks of data (whole years). We prevent this from happening again by introducing the
require_monotonic=Truekeyword to resample time. This will raise an error instead of trying to resample a daily timeseries with whole years missing.Since we don't understand how this corruption is introduced, we code defensively to list all multifile urls in debug mode, hopefully catching missing files (an easy explanation, could be filesystem issues under heavy load). We also introduce a check based on the multifile resolver that a multifile dataset needs to have the years (or months, or a combination) found from the placeholders urls. So dataset based on
{year}.ncwith[2021.nc, 2022.nc]requires a2021and2022timestamp in the time dimension (if any).Finally, we introduce an error when users pass
parallel=Trueandlock=Falsefor a dataset with the default or threaded dask options, as it is guaranteed to cause corruption, as the netcdf library is not thread safe.General Checklist
mainAdditional Notes
Note that this issue could still occur (just not in
resample_time), when the bug is in the dask computation. Here we exclude other (easier) possible causes. We could make the time dimension check delayed as well.Since this is potentially breaking if your datasets contain non-monotonic time periods (unexpectedly), I'm pinging @JoostBuitink @hboisgon @atsiokanos @michaelohanrahan @veenstrajelmer @DirkEilander. This is an FYI, but let me know if this breaks something for you.
PS. The
p:/wflow_global/forcing/MSWEP_V316_test/Past/Yearly/mswep_v316_{year}.ncdataset contains two missing days in August 1993.PPS @deltamarnix you made Deltares/hydromt_wflow@0d462a7, do you have a link to the original issue?