enhance consolidate training run outputs into a single runs/ directory#580
enhance consolidate training run outputs into a single runs/ directory#580sudhansu-24 wants to merge 5 commits intomllam:mainfrom
Conversation
|
thanks @sadamov for clarifying. I’ll continue implementation/revisions on #580 and coordinate here. |
|
Thanks @sudhansu-24 for taking this forward! From my side, I’ve aligned all training artifacts so they are now scoped under I also updated the logger setup so that both WandbLogger and CustomMLFlowLogger use the same Currently, only one logger is active at a time (default is W&B), and MLflow artifacts are generated when explicitly running with If there are any preferences around structure or logging behavior from earlier work, I’m happy to incorporate them. Let me know if you’d like me to push any additional changes to the PR. |
|
Thanks @Shyam-Sunder-saini could you share the exact changes you want added beyond the current #580 state (especially around the a short checklist by file, or a commit/PR branch we can cherry-pick from? if you post that i will incorporate it quickly so we can finalize review. |
cd1a3ef to
dd2a1a2
Compare
|
Hi @sadamov noticed the duplicate label and i am a bit confused #580 was opened on Apr 4, and as per your coordination note on #293 I asked @Shyam-Sunder-saini and @techaadii to contribute their changes here so we could converge, now i found that there's been seperate PR #586 opened on Apr 9 by @Shyam-Sunder-saini despite me asking any changes they want on top of my PR so i could incoporate it in #580 itself. If you'd prefer to land via #586, let me know and I'll close this. Otherwise I'll wait for your review. |
|
@sudhansu-24 i fixed the labels, could you also fix the title? This is not a bug but an enhancement. you are no both assigned and we will continue with the implementation here in this PR. |
There was a problem hiding this comment.
Implementation is clean and minimal. Inline suggestions below. A few points can't anchor inline because the touched lines are outside the diff hunks, so noting them here:
- README.md wasn't touched and still references the old layout (#293 scoped this). Line 407 says
wandb/dryrun...(no longer produced); line 440 says checkpoints live insaved_models/(now underruns/<run-name>/checkpoints/). Worth a small README pass in this PR. setup_training_loggerinneural_lam/utils.pygained arun_dirparameter but the docstring Parameters section wasn't updated. Please add arun_dir : strentry describing it as the directory under which all run artifacts are written.tests/test_cli.py::test_wandb_logger_kwargsnow passesrun_dirbut doesn't assert it reachesWandbLogger. One extraassert kwargs["save_dir"] == "runs/my-run"next to the existing kwargs assertions closes the loop.
|
Follow-up on the README gap mentioned in the review: the SLURM example at lines 494-495 writes to |
|
thanks @sadamov addressed all four suggestions and the other three changes from review. |
|
@sudhansu-24 please fix the conflict then I can do the final review. |
# Conflicts: # neural_lam/utils.py
@sadamov done, ready for review. |
Describe your changes
Training and evaluation artifacts are written under a single directory
runs/<run-name>/:ModelCheckpointusesruns/<run-name>/checkpoints/,Trainer(default_root_dir=...)keeps Lightning CSV logs under that run instead of a top-levellightning_logs/, andWandbLogger/CustomMLFlowLoggerusesave_dir=run_dirso internal logger paths and code usingself.logger.save_dir(e.g. plots) stay under the run root. Checkpoints remain outside W&B’swandb/subtree so large files are not synced by default.Motivation: Issue #293 and maintainer feedback (W&B selective sync, common run root, MLflow temp images not in CWD).
Dependencies: None
Issue Link
closes #293
Type of change
Checklist before requesting a review
pullwith--rebaseoption if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee