Skip to content

Add dry-run data preflight mode in train_model#511

Draft
AR10129 wants to merge 2 commits intomllam:mainfrom
AR10129:feat/dry-run-data-preflight
Draft

Add dry-run data preflight mode in train_model#511
AR10129 wants to merge 2 commits intomllam:mainfrom
AR10129:feat/dry-run-data-preflight

Conversation

@AR10129
Copy link
Copy Markdown

@AR10129 AR10129 commented Mar 24, 2026

Describe your changes

This PR adds an optional dry-run data preflight path in the training CLI to fail fast on dataset/configuration errors before model and trainer initialization.

The new flag validates one batch from the relevant dataloader(s) and checks batch structure, expected tensor dimensions, finite values, forcing-window consistency, and strictly increasing target times.

This reduces late pipeline failures and debugging time for invalid data/window settings.

No new runtime dependencies are introduced.

Issue Link

closes #510

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug
    • maintenance: when your contribution is relates to repo maintenance, e.g. CI/CD or documentation

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • (if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone. If it is not, propose it for a future milestone.
  • author has added an entry to the changelog (and designated the change as added, changed, fixed or maintenance)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@Sir-Sloth-The-Lazy
Copy link
Copy Markdown
Contributor

Thanks for this! A few observations from reading through:

  • Lightning already does this as num_sanity_val_steps (default 2) runs validation batches before training begins. Shape errors, NaN, and dtype issues surface there with full stack traces pointing to the exact failing operation. What specific failure does this catch that sanity validation misses?

  • The checks hardcode assumptions that the batch must be a 4-tuple, init_states.shape[1] must be 2, target_times must be integer dtype. If the data pipeline evolves (e.g. more history steps, ensemble dims), this validation code will need updating alongside the actual code. That's extra maintenance burden.

  • Some checks can't fail in practice because WeatherDataset produces target times by slicing contiguous time steps, so they're monotonic by construction. The forcing_window_size > 0 check also can't fail since it's computed as past + future + 1.

  • The unsqueeze logic (lines 39-48) is concerning the validation code probably shouldn't be reshaping inputs to handle multiple input formats.

This is just to take some load of the reviewers ! Hope you find this constructive ! @AR10129 Pardon me if I missed something, I would be grateful to learn !

@AR10129
Copy link
Copy Markdown
Author

AR10129 commented Mar 24, 2026

Thanks for the thorough review, these are fair points. Let me address each one:

  1. On num_sanity_val_steps: agreed there is overlap. The intended difference is that --dry_run_data validates data contracts before model and trainer setup and supports explicit preflight workflows. I will keep only checks that add value beyond Lightning sanity validation.

  2. On hardcoded assumptions: fair point. I’ll remove assumptions that are not strict data contracts and derive expectations from runtime config/datastore where possible.

  3. On checks that can’t fail in current construction: agreed. I’ll remove the monotonic-time and forcing-window-positive checks.

  4. On unsqueeze logic: agreed. I’ll drop compatibility reshaping and validate canonical batched format only.

I’ll push a follow-up cleanup commit with these changes.

@sadamov sadamov added the enhancement New feature or request label Apr 13, 2026
@archit7-beep
Copy link
Copy Markdown
Contributor

@sadamov @AR10129

Hi! I took some time to go through this issue and the associated PR, along with the review discussion.

It looks like the original version had quite strict validation (shape assumptions, monotonic time checks, etc.), and the follow-up changes simplified things a lot to avoid brittle assumptions. That makes sense, but it also seems like we may have lost some invariant-based validation and failure-case coverage in the process.

In particular, there might still be value in validating a small set of configuration-driven invariants (e.g. AR-step alignment and forcing-window consistency) and keeping at least one realistic failure-mode test to ensure the preflight path is actually exercised.

Since there hasn’t been activity on this for a while, I’d be happy to pick this up and propose a small follow-up PR that:

  • keeps the validation minimal and config-driven (no hardcoded assumptions)
  • reintroduces a few robust invariants that can actually fail
  • adds one or two corresponding failure tests

Let me know if that direction sounds reasonable, and I can open a PR !

@sadamov
Copy link
Copy Markdown
Collaborator

sadamov commented Apr 27, 2026

@AR10129 are you okay with @archit7-beep taking over here? I did not have time yet to review anything unfortunately, just triaging a bit.

@AR10129
Copy link
Copy Markdown
Author

AR10129 commented Apr 27, 2026

Hey @archit7-beep, given your analysis and the direction you've proposed, I think it makes sense for you to take this forward from here. The points around config-driven invariants and proper failure-mode tests are well reasoned and I'd love to see them land.

I've been a bit occupied lately and don't want progress on this to stall because of my bandwidth. You clearly have a good grasp of what needs to be done, so please go ahead!

That said, I'd like to stay involved where I can, happy to review, share context on the existing implementation, or test things out as you go.

@sadamov flagging this so you're in the loop!

@sadamov sadamov self-requested a review May 4, 2026 07:30
Copy link
Copy Markdown
Collaborator

@sadamov sadamov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archit7-beep @kshirajahere -- linking these two pieces of work because they overlap significantly.

#613 is proposing to replace the positional 4-tuple batch contract with a ForecastBatch(NamedTuple), end-to-end across WeatherDataset, ar_model.py, and tests. kshirajahere has an implementation locally and is close to opening a PR.

Once that lands, the validation logic in this PR would need to be rewritten -- _validate_preflight_batch currently unpacks a positional tuple and checks shape[1] directly, which would either break or become meaningless against a named batch type. The checks themselves (AR-step alignment, forcing window divisibility) are still worth keeping, but expressed against ForecastBatch fields.

Rather than merging this now and fixing it up later, would it make sense to stack this PR on top of #613's branch? That way:

#613 merges first and establishes the batch contract
this PR rebases onto it and rewrites _validate_preflight_batch against named fields
both land cleanly without a follow-up fixup commit

@sadamov sadamov marked this pull request as draft May 5, 2026 07:47
@archit7-beep
Copy link
Copy Markdown
Contributor

@sadamov
One thought here — since the batch contract is in transition (tuple → ForecastBatch), would it make sense for the preflight validation to support both formats temporarily?

For example, using a small helper to access fields (by name rather than position) so the validation logic remains stable across both representations.

This might allow us to move forward without being fully blocked on #613, while still aligning with the intended direction.

@kshirajahere
Copy link
Copy Markdown
Contributor

@sadamov waiting on merge of 208 as joel told no other PRs to be merged before that so waiting🙃

@sadamov
Copy link
Copy Markdown
Collaborator

sadamov commented May 6, 2026

@archit7-beep @kshirajahere yeah exactly there is no rush or need for intermediate workarounds since we are all waiting for #208

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add --dry_run_data preflight mode in train_model for early dataset validation failures

5 participants