Add dry-run data preflight mode in train_model#511
Add dry-run data preflight mode in train_model#511AR10129 wants to merge 2 commits intomllam:mainfrom
Conversation
|
Thanks for this! A few observations from reading through:
This is just to take some load of the reviewers ! Hope you find this constructive ! @AR10129 Pardon me if I missed something, I would be grateful to learn ! |
|
Thanks for the thorough review, these are fair points. Let me address each one:
I’ll push a follow-up cleanup commit with these changes. |
|
Hi! I took some time to go through this issue and the associated PR, along with the review discussion. It looks like the original version had quite strict validation (shape assumptions, monotonic time checks, etc.), and the follow-up changes simplified things a lot to avoid brittle assumptions. That makes sense, but it also seems like we may have lost some invariant-based validation and failure-case coverage in the process. In particular, there might still be value in validating a small set of configuration-driven invariants (e.g. AR-step alignment and forcing-window consistency) and keeping at least one realistic failure-mode test to ensure the preflight path is actually exercised. Since there hasn’t been activity on this for a while, I’d be happy to pick this up and propose a small follow-up PR that:
Let me know if that direction sounds reasonable, and I can open a PR ! |
|
@AR10129 are you okay with @archit7-beep taking over here? I did not have time yet to review anything unfortunately, just triaging a bit. |
|
Hey @archit7-beep, given your analysis and the direction you've proposed, I think it makes sense for you to take this forward from here. The points around config-driven invariants and proper failure-mode tests are well reasoned and I'd love to see them land. I've been a bit occupied lately and don't want progress on this to stall because of my bandwidth. You clearly have a good grasp of what needs to be done, so please go ahead! That said, I'd like to stay involved where I can, happy to review, share context on the existing implementation, or test things out as you go. @sadamov flagging this so you're in the loop! |
sadamov
left a comment
There was a problem hiding this comment.
@archit7-beep @kshirajahere -- linking these two pieces of work because they overlap significantly.
#613 is proposing to replace the positional 4-tuple batch contract with a ForecastBatch(NamedTuple), end-to-end across WeatherDataset, ar_model.py, and tests. kshirajahere has an implementation locally and is close to opening a PR.
Once that lands, the validation logic in this PR would need to be rewritten -- _validate_preflight_batch currently unpacks a positional tuple and checks shape[1] directly, which would either break or become meaningless against a named batch type. The checks themselves (AR-step alignment, forcing window divisibility) are still worth keeping, but expressed against ForecastBatch fields.
Rather than merging this now and fixing it up later, would it make sense to stack this PR on top of #613's branch? That way:
#613 merges first and establishes the batch contract
this PR rebases onto it and rewrites _validate_preflight_batch against named fields
both land cleanly without a follow-up fixup commit
|
@sadamov For example, using a small helper to access fields (by name rather than position) so the validation logic remains stable across both representations. This might allow us to move forward without being fully blocked on #613, while still aligning with the intended direction. |
|
@sadamov waiting on merge of 208 as joel told no other PRs to be merged before that so waiting🙃 |
|
@archit7-beep @kshirajahere yeah exactly there is no rush or need for intermediate workarounds since we are all waiting for #208 |
Describe your changes
This PR adds an optional dry-run data preflight path in the training CLI to fail fast on dataset/configuration errors before model and trainer initialization.
The new flag validates one batch from the relevant dataloader(s) and checks batch structure, expected tensor dimensions, finite values, forcing-window consistency, and strictly increasing target times.
This reduces late pipeline failures and debugging time for invalid data/window settings.
No new runtime dependencies are introduced.
Issue Link
closes #510
Type of change
Checklist before requesting a review
pullwith--rebaseoption if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee