Skip to content

enhance consolidate training run outputs into a single runs/ directory#580

Open
sudhansu-24 wants to merge 5 commits intomllam:mainfrom
sudhansu-24:fix-run-outputs
Open

enhance consolidate training run outputs into a single runs/ directory#580
sudhansu-24 wants to merge 5 commits intomllam:mainfrom
sudhansu-24:fix-run-outputs

Conversation

@sudhansu-24
Copy link
Copy Markdown

Describe your changes

Training and evaluation artifacts are written under a single directory runs/<run-name>/: ModelCheckpoint uses runs/<run-name>/checkpoints/, Trainer(default_root_dir=...) keeps Lightning CSV logs under that run instead of a top-level lightning_logs/, and WandbLogger / CustomMLFlowLogger use save_dir=run_dir so internal logger paths and code using self.logger.save_dir (e.g. plots) stay under the run root. Checkpoints remain outside W&B’s wandb/ subtree so large files are not synced by default.

Motivation: Issue #293 and maintainer feedback (W&B selective sync, common run root, MLflow temp images not in CWD).
Dependencies: None

Issue Link

closes #293

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • [] I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug
    • maintenance: when your contribution is relates to repo maintenance, e.g. CI/CD or documentation

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • (if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone. If it is not, propose it for a future milestone.
  • author has added an entry to the changelog (and designated the change as added, changed, fixed or maintenance)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@joeloskarsson
Copy link
Copy Markdown
Collaborator

@sadamov assigning you here to decide later if this should be closed in favor of #297 or what is the best path forward with this.

@sudhansu-24
Copy link
Copy Markdown
Author

thanks @sadamov for clarifying.

I’ll continue implementation/revisions on #580 and coordinate here.
@Shyam-Sunder-saini @techaadii, if you have any pending changes or preferences from your earlier work that should be included please share them and i will incorporate them into this pr so we can converge quickly

@Shyam-Sunder-saini
Copy link
Copy Markdown

Thanks @sudhansu-24 for taking this forward!

From my side, I’ve aligned all training artifacts so they are now scoped under runs/<run-name>/, including checkpoints, Lightning logs, W&B, and MLflow outputs.

I also updated the logger setup so that both WandbLogger and CustomMLFlowLogger use the same save_dir (run directory). Additionally, MLflow now falls back to the run directory if MLFLOW_TRACKING_URI is not set.

Currently, only one logger is active at a time (default is W&B), and MLflow artifacts are generated when explicitly running with --logger mlflow.

If there are any preferences around structure or logging behavior from earlier work, I’m happy to incorporate them.

Let me know if you’d like me to push any additional changes to the PR.

@sudhansu-24
Copy link
Copy Markdown
Author

sudhansu-24 commented Apr 8, 2026

Thanks @Shyam-Sunder-saini

could you share the exact changes you want added beyond the current #580 state (especially around the MLFLOW_TRACKING_URI fallback) either as:

a short checklist by file, or a commit/PR branch we can cherry-pick from?

if you post that i will incorporate it quickly so we can finalize review.

@sadamov sadamov added the bug Something isn't working label Apr 13, 2026
@sadamov sadamov added the duplicate This issue or pull request already exists label Apr 17, 2026
@sadamov sadamov self-requested a review April 17, 2026 19:30
@sudhansu-24
Copy link
Copy Markdown
Author

Hi @sadamov noticed the duplicate label and i am a bit confused

#580 was opened on Apr 4, and as per your coordination note on #293 I asked @Shyam-Sunder-saini and @techaadii to contribute their changes here so we could converge, now i found that there's been seperate PR #586 opened on Apr 9 by @Shyam-Sunder-saini despite me asking any changes they want on top of my PR so i could incoporate it in #580 itself.

If you'd prefer to land via #586, let me know and I'll close this. Otherwise I'll wait for your review.

@sadamov sadamov added enhancement New feature or request and removed duplicate This issue or pull request already exists bug Something isn't working labels Apr 23, 2026
@sadamov
Copy link
Copy Markdown
Collaborator

sadamov commented Apr 23, 2026

@sudhansu-24 i fixed the labels, could you also fix the title? This is not a bug but an enhancement. you are no both assigned and we will continue with the implementation here in this PR.

@sudhansu-24 sudhansu-24 changed the title fix consolidate training run outputs into a single runs/ directory enhance consolidate training run outputs into a single runs/ directory Apr 23, 2026
@sudhansu-24
Copy link
Copy Markdown
Author

@sadamov Thanks for clarification!
Updated the PR title and will continue on #580 and address any remaining review feedback here.

Copy link
Copy Markdown
Collaborator

@sadamov sadamov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is clean and minimal. Inline suggestions below. A few points can't anchor inline because the touched lines are outside the diff hunks, so noting them here:

  • README.md wasn't touched and still references the old layout (#293 scoped this). Line 407 says wandb/dryrun... (no longer produced); line 440 says checkpoints live in saved_models/ (now under runs/<run-name>/checkpoints/). Worth a small README pass in this PR.
  • setup_training_logger in neural_lam/utils.py gained a run_dir parameter but the docstring Parameters section wasn't updated. Please add a run_dir : str entry describing it as the directory under which all run artifacts are written.
  • tests/test_cli.py::test_wandb_logger_kwargs now passes run_dir but doesn't assert it reaches WandbLogger. One extra assert kwargs["save_dir"] == "runs/my-run" next to the existing kwargs assertions closes the loop.

Comment thread .gitignore Outdated
Comment thread neural_lam/custom_loggers.py Outdated
Comment thread neural_lam/custom_loggers.py
Comment thread neural_lam/train_model.py Outdated
@sadamov
Copy link
Copy Markdown
Collaborator

sadamov commented Apr 24, 2026

Follow-up on the README gap mentioned in the review: the SLURM example at lines 494-495 writes to lightning_logs/ we should also adjust that to the new run folder:

#SBATCH --output=runs/slurm_logs/neurallam_out_%j.log
#SBATCH --error=runs/slurm_logs/neurallam_err_%j.log

@sudhansu-24
Copy link
Copy Markdown
Author

sudhansu-24 commented Apr 25, 2026

thanks @sadamov

addressed all four suggestions and the other three changes from review.
local tests/test_cli.py (7/7) and tests/test_training.py::test_training[dummydata] pass.

@sadamov sadamov self-requested a review April 27, 2026 07:56
@sadamov
Copy link
Copy Markdown
Collaborator

sadamov commented Apr 27, 2026

@sudhansu-24 please fix the conflict then I can do the final review.

@sudhansu-24
Copy link
Copy Markdown
Author

@sudhansu-24 please fix the conflict then I can do the final review.

@sadamov done, ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate Training/Evaluation Run Outputs into a Single runs/ Directory

4 participants