enhance consolidate training run outputs into a single runs/ directory by sudhansu-24 · Pull Request #580 · mllam/neural-lam

sudhansu-24 · 2026-04-04T14:34:13Z

Describe your changes

Training and evaluation artifacts are written under a single directory runs/<run-name>/: ModelCheckpoint uses runs/<run-name>/checkpoints/, Trainer(default_root_dir=...) keeps Lightning CSV logs under that run instead of a top-level lightning_logs/, and WandbLogger / CustomMLFlowLogger use save_dir=run_dir so internal logger paths and code using self.logger.save_dir (e.g. plots) stay under the run root. Checkpoints remain outside W&B’s wandb/ subtree so large files are not synced by default.

Motivation: Issue #293 and maintainer feedback (W&B selective sync, common run root, MLflow temp images not in CWD).
Dependencies: None

Issue Link

closes #293

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
[] I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug
- maintenance: when your contribution is relates to repo maintenance, e.g. CI/CD or documentation

Checklist for assignee

PR is up to date with the base branch
the tests pass
(if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone. If it is not, propose it for a future milestone.
author has added an entry to the changelog (and designated the change as added, changed, fixed or maintenance)
Once the PR is ready to be merged, squash commits and merge the PR.

joeloskarsson · 2026-04-05T09:54:54Z

@sadamov assigning you here to decide later if this should be closed in favor of #297 or what is the best path forward with this.

sudhansu-24 · 2026-04-08T08:23:19Z

thanks @sadamov for clarifying.

I’ll continue implementation/revisions on #580 and coordinate here.
@Shyam-Sunder-saini @techaadii, if you have any pending changes or preferences from your earlier work that should be included please share them and i will incorporate them into this pr so we can converge quickly

Shyam-Sunder-saini · 2026-04-08T16:12:38Z

Thanks @sudhansu-24 for taking this forward!

From my side, I’ve aligned all training artifacts so they are now scoped under runs/<run-name>/, including checkpoints, Lightning logs, W&B, and MLflow outputs.

I also updated the logger setup so that both WandbLogger and CustomMLFlowLogger use the same save_dir (run directory). Additionally, MLflow now falls back to the run directory if MLFLOW_TRACKING_URI is not set.

Currently, only one logger is active at a time (default is W&B), and MLflow artifacts are generated when explicitly running with --logger mlflow.

If there are any preferences around structure or logging behavior from earlier work, I’m happy to incorporate them.

Let me know if you’d like me to push any additional changes to the PR.

sudhansu-24 · 2026-04-08T18:15:25Z

Thanks @Shyam-Sunder-saini

could you share the exact changes you want added beyond the current #580 state (especially around the MLFLOW_TRACKING_URI fallback) either as:

a short checklist by file, or a commit/PR branch we can cherry-pick from?

if you post that i will incorporate it quickly so we can finalize review.

sudhansu-24 · 2026-04-22T15:44:43Z

Hi @sadamov noticed the duplicate label and i am a bit confused

#580 was opened on Apr 4, and as per your coordination note on #293 I asked @Shyam-Sunder-saini and @techaadii to contribute their changes here so we could converge, now i found that there's been seperate PR #586 opened on Apr 9 by @Shyam-Sunder-saini despite me asking any changes they want on top of my PR so i could incoporate it in #580 itself.

If you'd prefer to land via #586, let me know and I'll close this. Otherwise I'll wait for your review.

sadamov · 2026-04-23T03:54:49Z

@sudhansu-24 i fixed the labels, could you also fix the title? This is not a bug but an enhancement. you are no both assigned and we will continue with the implementation here in this PR.

sudhansu-24 · 2026-04-23T07:40:40Z

@sadamov Thanks for clarification!
Updated the PR title and will continue on #580 and address any remaining review feedback here.

sadamov

Implementation is clean and minimal. Inline suggestions below. A few points can't anchor inline because the touched lines are outside the diff hunks, so noting them here:

README.md wasn't touched and still references the old layout (#293 scoped this). Line 407 says wandb/dryrun... (no longer produced); line 440 says checkpoints live in saved_models/ (now under runs/<run-name>/checkpoints/). Worth a small README pass in this PR.
setup_training_logger in neural_lam/utils.py gained a run_dir parameter but the docstring Parameters section wasn't updated. Please add a run_dir : str entry describing it as the directory under which all run artifacts are written.
tests/test_cli.py::test_wandb_logger_kwargs now passes run_dir but doesn't assert it reaches WandbLogger. One extra assert kwargs["save_dir"] == "runs/my-run" next to the existing kwargs assertions closes the loop.

sadamov · 2026-04-24T09:58:33Z

Follow-up on the README gap mentioned in the review: the SLURM example at lines 494-495 writes to lightning_logs/ we should also adjust that to the new run folder:

#SBATCH --output=runs/slurm_logs/neurallam_out_%j.log
#SBATCH --error=runs/slurm_logs/neurallam_err_%j.log

sudhansu-24 · 2026-04-25T11:52:37Z

thanks @sadamov

addressed all four suggestions and the other three changes from review.
local tests/test_cli.py (7/7) and tests/test_training.py::test_training[dummydata] pass.

sadamov · 2026-04-27T07:56:55Z

@sudhansu-24 please fix the conflict then I can do the final review.

# Conflicts: # neural_lam/utils.py

sudhansu-24 · 2026-04-27T08:10:50Z

@sudhansu-24 please fix the conflict then I can do the final review.

@sadamov done, ready for review.

sudhansu-24 mentioned this pull request Apr 4, 2026

Consolidate training run outputs into a single runs/<run-name>/ directory #297

Closed

21 tasks

joeloskarsson assigned sadamov Apr 5, 2026

sadamov mentioned this pull request Apr 8, 2026

Consolidate Training/Evaluation Run Outputs into a Single runs/ Directory #293

Open

fix consolidate training run outputs into a single runs/ directory

dd2a1a2

sudhansu-24 force-pushed the fix-run-outputs branch from cd1a3ef to dd2a1a2 Compare April 12, 2026 15:57

sadamov added the bug Something isn't working label Apr 13, 2026

update tests for run_dir logger arg

ccf02cb

sadamov added the duplicate This issue or pull request already exists label Apr 17, 2026

sadamov self-requested a review April 17, 2026 19:30

sadamov mentioned this pull request Apr 23, 2026

Mlflow _Tracking _Uri Fix #586

Closed

Merge branch 'main' into fix-run-outputs

331a382

sadamov added enhancement New feature or request and removed duplicate This issue or pull request already exists bug Something isn't working labels Apr 23, 2026

sudhansu-24 changed the title ~~fix consolidate training run outputs into a single runs/ directory~~ enhance consolidate training run outputs into a single runs/ directory Apr 23, 2026

sadamov reviewed Apr 24, 2026

View reviewed changes

Comment thread .gitignore Outdated

Comment thread neural_lam/custom_loggers.py Outdated

Comment thread neural_lam/custom_loggers.py

Comment thread neural_lam/train_model.py Outdated

after-review changes: save_dir, RUNS_ROOT, README and docstring updates

3bd16d8

sadamov self-requested a review April 27, 2026 07:56

Merge remote-tracking branch 'upstream/main' into fix-run-outputs

197868b

# Conflicts: # neural_lam/utils.py

sadamov mentioned this pull request Apr 28, 2026

Resource leak and temp file accumulation in log_image() #496

Open

21 tasks

Conversation

sudhansu-24 commented Apr 4, 2026

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

joeloskarsson commented Apr 5, 2026

Uh oh!

sudhansu-24 commented Apr 8, 2026

Uh oh!

Shyam-Sunder-saini commented Apr 8, 2026

Uh oh!

sudhansu-24 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sudhansu-24 commented Apr 22, 2026

Uh oh!

sadamov commented Apr 23, 2026

Uh oh!

sudhansu-24 commented Apr 23, 2026

Uh oh!

sadamov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadamov commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sudhansu-24 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadamov commented Apr 27, 2026

Uh oh!

sudhansu-24 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sudhansu-24 commented Apr 8, 2026 •

edited

Loading

sadamov left a comment •

edited

Loading

sadamov commented Apr 24, 2026 •

edited

Loading

sudhansu-24 commented Apr 25, 2026 •

edited

Loading