Add dry-run data preflight mode in train_model by AR10129 · Pull Request #511 · mllam/neural-lam

AR10129 · 2026-03-24T11:08:59Z

Describe your changes

This PR adds an optional dry-run data preflight path in the training CLI to fail fast on dataset/configuration errors before model and trainer initialization.

The new flag validates one batch from the relevant dataloader(s) and checks batch structure, expected tensor dimensions, finite values, forcing-window consistency, and strictly increasing target times.

This reduces late pipeline failures and debugging time for invalid data/window settings.

No new runtime dependencies are introduced.

Issue Link

closes #510

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug
- maintenance: when your contribution is relates to repo maintenance, e.g. CI/CD or documentation

Checklist for assignee

PR is up to date with the base branch
the tests pass
(if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone. If it is not, propose it for a future milestone.
author has added an entry to the changelog (and designated the change as added, changed, fixed or maintenance)
Once the PR is ready to be merged, squash commits and merge the PR.

Sir-Sloth-The-Lazy · 2026-03-24T16:11:05Z

Thanks for this! A few observations from reading through:

Lightning already does this as num_sanity_val_steps (default 2) runs validation batches before training begins. Shape errors, NaN, and dtype issues surface there with full stack traces pointing to the exact failing operation. What specific failure does this catch that sanity validation misses?
The checks hardcode assumptions that the batch must be a 4-tuple, init_states.shape[1] must be 2, target_times must be integer dtype. If the data pipeline evolves (e.g. more history steps, ensemble dims), this validation code will need updating alongside the actual code. That's extra maintenance burden.
Some checks can't fail in practice because WeatherDataset produces target times by slicing contiguous time steps, so they're monotonic by construction. The forcing_window_size > 0 check also can't fail since it's computed as past + future + 1.
The unsqueeze logic (lines 39-48) is concerning the validation code probably shouldn't be reshaping inputs to handle multiple input formats.

This is just to take some load of the reviewers ! Hope you find this constructive ! @AR10129 Pardon me if I missed something, I would be grateful to learn !

AR10129 · 2026-03-24T17:11:56Z

Thanks for the thorough review, these are fair points. Let me address each one:

On num_sanity_val_steps: agreed there is overlap. The intended difference is that --dry_run_data validates data contracts before model and trainer setup and supports explicit preflight workflows. I will keep only checks that add value beyond Lightning sanity validation.
On hardcoded assumptions: fair point. I’ll remove assumptions that are not strict data contracts and derive expectations from runtime config/datastore where possible.
On checks that can’t fail in current construction: agreed. I’ll remove the monotonic-time and forcing-window-positive checks.
On unsqueeze logic: agreed. I’ll drop compatibility reshaping and validate canonical batched format only.

I’ll push a follow-up cleanup commit with these changes.

archit7-beep · 2026-04-20T16:13:00Z

@sadamov @AR10129

Hi! I took some time to go through this issue and the associated PR, along with the review discussion.

It looks like the original version had quite strict validation (shape assumptions, monotonic time checks, etc.), and the follow-up changes simplified things a lot to avoid brittle assumptions. That makes sense, but it also seems like we may have lost some invariant-based validation and failure-case coverage in the process.

In particular, there might still be value in validating a small set of configuration-driven invariants (e.g. AR-step alignment and forcing-window consistency) and keeping at least one realistic failure-mode test to ensure the preflight path is actually exercised.

Since there hasn’t been activity on this for a while, I’d be happy to pick this up and propose a small follow-up PR that:

keeps the validation minimal and config-driven (no hardcoded assumptions)
reintroduces a few robust invariants that can actually fail
adds one or two corresponding failure tests

Let me know if that direction sounds reasonable, and I can open a PR !

sadamov · 2026-04-27T07:38:08Z

@AR10129 are you okay with @archit7-beep taking over here? I did not have time yet to review anything unfortunately, just triaging a bit.

AR10129 · 2026-04-27T08:30:39Z

Hey @archit7-beep, given your analysis and the direction you've proposed, I think it makes sense for you to take this forward from here. The points around config-driven invariants and proper failure-mode tests are well reasoned and I'd love to see them land.

I've been a bit occupied lately and don't want progress on this to stall because of my bandwidth. You clearly have a good grasp of what needs to be done, so please go ahead!

That said, I'd like to stay involved where I can, happy to review, share context on the existing implementation, or test things out as you go.

@sadamov flagging this so you're in the loop!

sadamov

@archit7-beep @kshirajahere -- linking these two pieces of work because they overlap significantly.

#613 is proposing to replace the positional 4-tuple batch contract with a ForecastBatch(NamedTuple), end-to-end across WeatherDataset, ar_model.py, and tests. kshirajahere has an implementation locally and is close to opening a PR.

Once that lands, the validation logic in this PR would need to be rewritten -- _validate_preflight_batch currently unpacks a positional tuple and checks shape[1] directly, which would either break or become meaningless against a named batch type. The checks themselves (AR-step alignment, forcing window divisibility) are still worth keeping, but expressed against ForecastBatch fields.

Rather than merging this now and fixing it up later, would it make sense to stack this PR on top of #613's branch? That way:

#613 merges first and establishes the batch contract
this PR rebases onto it and rewrites _validate_preflight_batch against named fields
both land cleanly without a follow-up fixup commit

archit7-beep · 2026-05-05T14:13:01Z

@sadamov
One thought here — since the batch contract is in transition (tuple → ForecastBatch), would it make sense for the preflight validation to support both formats temporarily?

For example, using a small helper to access fields (by name rather than position) so the validation logic remains stable across both representations.

This might allow us to move forward without being fully blocked on #613, while still aligning with the intended direction.

kshirajahere · 2026-05-05T19:34:31Z

@sadamov waiting on merge of 208 as joel told no other PRs to be merged before that so waiting🙃

sadamov · 2026-05-06T08:50:21Z

@archit7-beep @kshirajahere yeah exactly there is no rush or need for intermediate workarounds since we are all waiting for #208

Add dry-run data preflight mode in train_model

e3929e5

Narrow dry-run preflight checks

a8cee5c

sadamov added the enhancement New feature or request label Apr 13, 2026

sadamov self-requested a review May 4, 2026 07:30

sadamov reviewed May 5, 2026

View reviewed changes

sadamov marked this pull request as draft May 5, 2026 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dry-run data preflight mode in train_model#511

Add dry-run data preflight mode in train_model#511
AR10129 wants to merge 2 commits intomllam:mainfrom
AR10129:feat/dry-run-data-preflight

AR10129 commented Mar 24, 2026 •

edited

Loading

Uh oh!

Sir-Sloth-The-Lazy commented Mar 24, 2026

Uh oh!

AR10129 commented Mar 24, 2026

Uh oh!

archit7-beep commented Apr 20, 2026

Uh oh!

sadamov commented Apr 27, 2026

Uh oh!

AR10129 commented Apr 27, 2026

Uh oh!

sadamov left a comment

Uh oh!

archit7-beep commented May 5, 2026

Uh oh!

kshirajahere commented May 5, 2026

Uh oh!

sadamov commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

AR10129 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

Sir-Sloth-The-Lazy commented Mar 24, 2026

Uh oh!

AR10129 commented Mar 24, 2026

Uh oh!

archit7-beep commented Apr 20, 2026

Uh oh!

sadamov commented Apr 27, 2026

Uh oh!

AR10129 commented Apr 27, 2026

Uh oh!

sadamov left a comment

Choose a reason for hiding this comment

Uh oh!

archit7-beep commented May 5, 2026

Uh oh!

kshirajahere commented May 5, 2026

Uh oh!

sadamov commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AR10129 commented Mar 24, 2026 •

edited

Loading

sadamov commented May 6, 2026 •

edited

Loading