Skip to content

Restrict possible multi-file reading errors.#1507

Merged
evetion merged 5 commits into
mainfrom
feat/temporal-defense
Jun 23, 2026
Merged

Restrict possible multi-file reading errors.#1507
evetion merged 5 commits into
mainfrom
feat/temporal-defense

Conversation

@evetion

@evetion evetion commented Jun 20, 2026

Copy link
Copy Markdown
Member

Issue addressed

Fixes Deltares/hydromt_wflow#741

Explanation

While we have not been able to reproduce the issue, we know it is caused by dropped chunks of data (whole years). We prevent this from happening again by introducing the require_monotonic=True keyword to resample time. This will raise an error instead of trying to resample a daily timeseries with whole years missing.

Since we don't understand how this corruption is introduced, we code defensively to list all multifile urls in debug mode, hopefully catching missing files (an easy explanation, could be filesystem issues under heavy load). We also introduce a check based on the multifile resolver that a multifile dataset needs to have the years (or months, or a combination) found from the placeholders urls. So dataset based on {year}.nc with [2021.nc, 2022.nc] requires a 2021 and 2022 timestamp in the time dimension (if any).

Finally, we introduce an error when users pass parallel=True and lock=False for a dataset with the default or threaded dask options, as it is guaranteed to cause corruption, as the netcdf library is not thread safe.

General Checklist

  • Updated tests or added new tests
  • Branch is up to date with main
  • Tests & pre-commit hooks pass
  • Updated documentation
  • Updated changelog.rst

Additional Notes

Note that this issue could still occur (just not in resample_time), when the bug is in the dask computation. Here we exclude other (easier) possible causes. We could make the time dimension check delayed as well.

Since this is potentially breaking if your datasets contain non-monotonic time periods (unexpectedly), I'm pinging @JoostBuitink @hboisgon @atsiokanos @michaelohanrahan @veenstrajelmer @DirkEilander. This is an FYI, but let me know if this breaks something for you.

PS. The p:/wflow_global/forcing/MSWEP_V316_test/Past/Yearly/mswep_v316_{year}.nc dataset contains two missing days in August 1993.
PPS @deltamarnix you made Deltares/hydromt_wflow@0d462a7, do you have a link to the original issue?

@evetion evetion requested a review from LuukBlom June 20, 2026 20:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds defensive checks to reduce silent data corruption when (1) resampling time series with irregular gaps, and (2) reading multi-file xarray datasets where resolved {year}/{month} placeholders don’t match the opened dataset’s time coverage. It also blocks an unsafe xarray.open_mfdataset(parallel=True, lock=False) configuration under threaded Dask schedulers.

Changes:

  • Add require_monotonic=True (default) to resample_time to reject irregular time steps before resampling.
  • Track resolved {year}/{month} placeholders in ConventionResolver and validate multi-file completeness against the opened dataset’s time coordinate.
  • Validate xarray driver kwargs to reject parallel=True + lock=False under threaded Dask schedulers; add tests + docs/changelog updates.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
hydromt/model/processes/meteo.py Adds monotonic/regular-time requirement before resampling and time-delta helpers.
hydromt/data_catalog/uri_resolvers/convention_resolver.py Normalizes Windows paths, logs resolved URIs in debug, and stores resolved placeholders.
hydromt/data_catalog/sources/data_source.py Adds multi-file completeness check comparing resolved placeholder periods to dataset time coordinate.
hydromt/data_catalog/sources/dataset.py Calls multi-file completeness check for convention-resolved datasets.
hydromt/data_catalog/sources/geodataset.py Calls multi-file completeness check for convention-resolved geodatasets; fixes docstring typos.
hydromt/data_catalog/sources/rasterdataset.py Calls multi-file completeness check for convention-resolved raster datasets.
hydromt/data_catalog/drivers/xarray_options.py Rejects unsafe parallel=True + lock=False under threaded Dask schedulers.
tests/model/processes/test_meteo.py Adds regression test for resampling rejection on irregular time-step gaps.
tests/data_catalog/uri_resolvers/test_convention_resolver.py Adds tests for debug logging of expanded URIs and Windows path normalization.
tests/data_catalog/sources/test_common_functionality.py Adds tests for multifile placeholder-vs-time completeness checks.
tests/data_catalog/drivers/raster/test_raster_xarray_driver.py Adds tests for Dask scheduler validation around parallel/lock handling.
docs/user_guide/data_catalog/data_types.rst Documents xarray driver option validation behavior.
docs/user_guide/data_catalog/data_prepare_cat.rst Documents multi-file placeholder completeness checks.
docs/changelog.rst Records behavior changes in resampling, driver option validation, and multifile completeness checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hydromt/model/processes/meteo.py Outdated
Comment thread hydromt/data_catalog/uri_resolvers/convention_resolver.py
Comment thread hydromt/data_catalog/sources/data_source.py Outdated
Comment thread hydromt/data_catalog/sources/data_source.py Outdated
Comment thread hydromt/data_catalog/sources/data_source.py Outdated
Comment thread tests/data_catalog/drivers/raster/test_raster_xarray_driver.py

@deltamarnix deltamarnix left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some changes in the way that these functions are called, but I agree with the overall construct of the erroring.

Comment thread hydromt/data_catalog/drivers/xarray_options.py Outdated
Comment thread hydromt/data_catalog/sources/data_source.py Outdated
Comment thread hydromt/data_catalog/sources/dataset.py
Comment thread hydromt/data_catalog/uri_resolvers/convention_resolver.py
evetion and others added 3 commits June 23, 2026 11:05
Co-authored-by: Marnix <150045289+deltamarnix@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

@LuukBlom LuukBlom left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! nice job

I have 1 nit that you can decide to implement or not.

@evetion evetion merged commit 10efe7a into main Jun 23, 2026
32 checks passed
@evetion evetion deleted the feat/temporal-defense branch June 23, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Different hydromt-generated precipitation forcings on Linux vs Windows

4 participants