Docs/hello world danra notebook 69#202
Conversation
|
@sadamov @joeloskarsson |
|
Thanks @Jayant-kernel for your work on this. I am having a look at the notebook right now. Could you in the meantime use the original PR template please? The current one seems to be missing some bullet points. |
There was a problem hiding this comment.
This is certainly going into the right direction, thanks again. What I really like is that the full notebook takes ~5mins on CPU. But, the data fetching and the model training fail with the current default parameters. Just wanted to emphasize again that in our org it's fine to be slower but more accurate and that PRs are tested by human hands before submission.
Three general points:
- have you already considered how you would keep this notebook up to date? So that for each release of neural-lam or mllam-data-prep the codecells still work (e.g. integrate notebook exec. in a test)?
a. one such change will be the introduction of weather-model-graphs for graph creation: #184 - it would be great if you could vizualise, the input data, the graph, the model predictions similar to this notebook here: https://github.com/joeloskarsson/neural-lam-dev/blob/research/docs/reproduce_paper_sample.md
- please fix the pre-commit errors
Here are some remarks from my initial walkthrough:
- Before the cell that "import os", advice the user to chose a pre-installed Python version according to the pyproject.toml file, that provides ipykernel
- Since the notebook is in ./docs/notebooks you first have to cd into ROOT, otherwise cells will not run in ROOT
- convert uv install into a code cell
- The CLI expects config as a positional argument, not --config for mllam-data-prep
- for the model training/eval you need to set
--val_steps_to_log 1because the default setting would validate ar_steps > 1 which we are not training on. You also need to set--num_workersto a higher value like 8 instead of 0, because by default neural-lam uses persistens workers, 0 is not allowed. - you can also link this paper, actually describing the research with DANRA: https://arxiv.org/abs/2504.09340
Now about the next steps: this issue is scheduled for version v0.7.0: https://github.com/mllam/neural-lam/milestone/12. that means we can merge it only after v0.6.0 is released. We discuss these milestones every month in the dev-meeting and they can be adjusted.
The work on this coul be part of a GSoC proposal for the project about structured docs.
|
@sadamov Changes made
I also added a note about future maintainability (notebook execution test in CI) and the upcoming Please take another look when you have time — happy to make more changes. |
|
Nice work @Jayant-kernel ! One small suggestion , it might be helpful to add a note in the notebook about the minimum required version of |
…edictions, version note, fix graph viz (mllam#202)
|
Hey @sadamov, another round of fixes: Checkpoint now picks the most recently modified .ckpt (not arbitrary first) |
There was a problem hiding this comment.
I just ran this notebook on my 7 year-old laptop and it works perfectly! 🤯 this is great!
since it is a bit tricky to review notebooks, I pushed some smaller changes directly into your branch. please have a look. we should keep the notebook rendered, so that other users can refer to the expected output.
there are currently two major open points that I would like to see addressed:
- The plotting
a. the input data should be a 2D horizontal plot of at least one of the 4 state features (atmospheric variables u100m v100m r2m t2m) and not a 1D-lineplot
b. the graph should be plotted and vizualised with neural_lam/plot_graph.py
c. the prediction plots fail to be displayed (and saved to disk?) please double check paths - CI/CD integration: In some form or another we should make sure that this notebook keeps up to date with the codebase. But unlike tests you are allowed to break the notebook cells. But we need to realize that they are broken and fix it for each PR in the future. Maybe it could become a new entry in PR-temple? not sure, open to your suggestions.
|
Thanks for the review @sadamov! All requested changes are pushed: Formatting: Updated PR template & fixed pre-commit issues. |
|
@Jayant-kernel @sadamov I have been following the PR, it look great! This is because PyTorch 2.6 changed the default of Though the cleaner fix is probably in train_model.py's checkpoint loading code rather than the notebook itself — worth checking if there's already an issue open for this. |
|
@Mani212005 were you actually able to run the notebook? For me the file was corrupted and I couldn't even open it. I pushed a quick fix to remove the artifacts that caused the not-opening issue. Now however I see a lot of other characters that were introduced in one of the recent commits, e.g. |
|
@sadamov I diffed the notebook history locally. what i can confirm from the file itself is:
Reverting back to e1bb57a is on the safer side. And afaik pytorch 2.6 checkpoint-loading error is separate |
|
@sadamov I saw the notebook source directly from the GitHub diff, not by running it locally. The PyTorch error I flagged was visible in the captured cell outputs embedded in the .ipynb JSON from a previous run. |
badfe4b to
44196b2
Compare
- Revert to e1bb57a baseline to remove LLM-introduced encoding artifacts - Replace 1D line plot in cell 12 with proper 2D pcolormesh map of state features using x/y grid coordinates from the zarr datastore - Add PyTorch 2.6 fix cell before eval: registers argparse.Namespace as a safe global so torch.load does not fail with weights_only=True - Clear stale error outputs from eval and prediction display cells
44196b2 to
42ad106
Compare
|
@sadamov rereview |
sadamov
left a comment
There was a problem hiding this comment.
The notebook is working again. But previous comments have not been adressed still. For example the plot of the graph object is missing and the plot of the prediction fails. Please also address the comment about torch > 2.6 from above.
once you are sure to have addressed all open remarks, please update the changelog
|
Hi, I would like to work on this issue. Could you please assign it to me? |
|
@yukthagangadhari5 so @Jayant-kernel is already working on this. But maybe he would appreciate some help on the visualizations (input data, graphs, predictions). Best to ask them directly here |
|
Thanks for the suggestion! I’d be happy to help with the visualizations. @Jayant-kernel would you like help adding plots for the input data, graphs, or model predictions in the notebook? Let me know what would be most useful. |
|
Hi @Jayant-kernel , I am interested in picking up this issue as part of the documentation generation project. I have the required skills in Python and Shell. Before I get started, are there any specific documentation frameworks (like Sphinx or pdoc) the team prefers, or should I start by doing some research and proposing a solution? I would love to be assigned to this! |
… notebook - Add cell that runs neural_lam.plot_graph (--datastore_config_path, --graph 1level, --save graph_viz.html) to produce an interactive Plotly HTML visualisation of the message-passing graph (addresses reviewer request in mllam#202) - Add cell that displays the saved graph_viz.html inline via IFrame - Replace broken PNG-glob prediction display with a cell that loads the example_pred_*.pt / example_target_*.pt tensors saved by the wandb logger and renders a side-by-side Ground Truth vs Prediction pcolormesh plot inline - Add CHANGELOG entry under [unreleased] / Added for PR mllam#202 Fixes remaining open points from sadamov's review on PR mllam#202.
|
Hi @sadamov — addressed all remaining open points from your last review: Graph visualisation = Added two cells after the graph-creation step: Runs neural_lam.plot_graph with --datastore_config_path, --graph 1level, and --save graph_viz.html Loads example_pred_.pt / example_target_.pt tensors saved by the wandb logger under wandb/ CHANGELOG , Entry added under [unreleased] |
sadamov
left a comment
There was a problem hiding this comment.
Thanks for adding these plots, here are some issues I had when running the latest notebooks additions:
- cell 13: could we retrieve the units directly from the xarray dataset here, because the coords do have units attrs. That would make it even more instructive. (e.g.
x_units = ds.coords["x"].attrs.get("units", "")) - cell 19: for me the graph html does not render in vscode (maybe related to security settings), could we use this instead
with open("graph_viz.html") as f: display(HTML(f.read())) - cell 26: the eval fails because of the way torch loads checkpoints e.g. with
torch=2.10.0. this issue will soon be addressed in #240. But until then we need another way to make it work here in this notebook. I can see the issue. Cell 24 already registers the safe globals for the notebook kernel, but the eval command in cell 25 runs as a subprocess (!{sys.executable} -m ...), so the fix doesn't carry over. you can use asitecustomize.pyfor example._pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. - cell 27: is currently broken even with the cell 26 fixed with
RecursionError: maximum recursion depth exceeded. But instead of fixing it I propose to stay closer to https://github.com/joeloskarsson/neural-lam-dev/blob/research/docs/reproduce_paper_sample.md for the final eval and vizualizations. So only generate one example_pred, save everything as zarr. vizualise the test_rmse.pdf directly from the output, and also visualize the preditions based on plots generated during eval. then there is no need to do the manual .pt loading and such. let me know if you think some approaches in the notebook linked here could be improved. very open to suggestions.
and two more general remarks:
- cell 8: when we use
!{sys.executable}the first time, could you add a comment saying smth along the lines of: "# sys.executable ensures we use the same Python as this notebook's kernel" - for the future maintainability we were talking about a see diverging approaches. What do you think is the best way forward?
- automatize pre-commits and github workflows to
nbmakethe full notebook and ensure it's compatability - add a note to the PR-template that tells authors to check if their changes break the notebooks
- do nothing and fix the notebooks when broken. the argument here is that breaking changes are rare to the core functionailities of the repo.
- automatize pre-commits and github workflows to
|
@Jayant-kernel doing some housekeep: there are many people interested in contributing to this PR. Are you still working on it or is it okay if somebody elsi takes over? |
|
@sadamov i can help integrate this into a new PR if @Jayant-kernel is not active. |
|
superceded by #577 |
Describe your changes
This PR adds a brief, onboarding Jupyter notebook (hello_world_danra.ipynb) to address #69.
It uses the existing DANRA 100m winds example configuration to walk new users through:
mllam-data-preppreprocessing1level) graph generationThis is a draft to get feedback on the notebook content, defaults (e.g.,
1levelvshierarchical), and location before finalizing.Issue Link
Addresses #69
Type of change
Checklist before requesting a review
Checklist for reviewers
Author checklist after completed review
Checklist for assignee