Skip to content

First model configuration with inference run from inference artifact#10

Merged
khintz merged 27 commits intodmidk:mainfrom
leifdenby:feat/forecast-inference-dataset-creation
Nov 6, 2025
Merged

First model configuration with inference run from inference artifact#10
khintz merged 27 commits intodmidk:mainfrom
leifdenby:feat/forecast-inference-dataset-creation

Conversation

@leifdenby
Copy link
Copy Markdown
Contributor

@leifdenby leifdenby commented May 14, 2025

This PR contains changes to implement the first model configuration that from an inference artifact as able to produce a gridded zarr forecast dataset. In detail the modified entry.sh-script which will serve as the container image entrypoint does the following:

  1. creates inference an DANRA datastore config and neural-lam from the configurations in the inference artifact
  2. calls mllam-data-prep to create an inference dataset with the inference datastore config
  3. calls neural-lam to produce an inference dataset, this is the transformed datastructure similar to the training/inference datasets (i.e. stacked spatial coordinates, stacked variables along feature coordinates, etc)
  4. calls mllam-data-prep to invert the inference datasets structure back to a gridded forecast zarr dataset with separate variables

Steps 3. and 4. require upstream changes to mllam-data-prep and neural-lam, which are detailed in the model configuration README. I suggest we merge this model configuration now and then I can work on getting these upstream changes in after, and then update the model configuration pyproject.toml file to point to main branches of neural-lam and mllam-data-prep, rather than the development branches that it currently points to:

# from pyproject.toml
mllam-data-prep = { git = "https://github.com/leifdenby/mllam-data-prep", rev = "feat/inference-cli-args" }
neural-lam = { git = "https://github.com/leifdenby/neural-lam", rev = "dev/first-inference-image" }
Screenshot 2025-09-26 at 09 20 28
Full execution output
 ✝  mlwm-deployment/configurations/surface-dummy-model_DINI   feat/forecast-inference-dataset-creation±  ./entry.sh
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
2025-09-26 09:14:25.912 | DEBUG    | __main__:_prepare_inference_dataset_zarr:197 - Opened stats dataset:  Size: 416B
Dimensions:                        (state_feature: 5, static_feature: 2)
Coordinates:
    static_feature_source_dataset  (static_feature) object 16B ...
  * state_feature                  (state_feature) object 40B 'pres_seasurfac...
  * static_feature                 (static_feature) object 16B 'lsm' 'orography'
    static_feature_long_name       (static_feature) object 16B ...
    state_feature_long_name        (state_feature) object 40B ...
    state_feature_source_dataset   (state_feature) object 40B ...
    static_feature_units           (static_feature) object 16B ...
    state_feature_units            (state_feature) object 40B ...
Data variables:
    state__train__diff_std         (state_feature) float64 40B ...
    state__train__diff_mean        (state_feature) float64 40B ...
    static__train__std             (static_feature) float64 16B ...
    state__train__mean             (state_feature) float64 40B ...
    state__train__std              (state_feature) float64 40B ...
    static__train__mean            (static_feature) float64 16B ...
Attributes:
    schema_version:   v0.5.0
    dataset_version:  v0.1.0
    created_on:       2025-05-15T16:58:37
    created_with:     mllam-data-prep (https://github.com/mllam/mllam-data-prep)
    mdp_version:      v0.6.0
    creation_config:  dataset-version: v0.1.0\nextra:\n  projection:\n    cla...
2025-09-26 09:14:25.912 | DEBUG    | __main__:_prepare_inference_dataset_zarr:199 - Loading training datastore config from inference_artifact/configs/danra.datastore.yaml
2025-09-26 09:14:25.916 | INFO     | __main__:_create_inference_datastore_config:100 - Overwriting input path for danra_surface with https://object-store.os-api.cci1.ecmwf.int/danra/v0.6.0dev1/single_levels.zarr/ previously https://object-store.os-api.cci1.ecmwf.int/mllam-testdata/danra_cropped/v0.2.0/single_levels.zarr
2025-09-26 09:14:25.916 | INFO     | __main__:_create_inference_datastore_config:100 - Overwriting input path for danra_static with https://object-store.os-api.cci1.ecmwf.int/danra/v0.5.0/single_levels.zarr/ previously https://object-store.os-api.cci1.ecmwf.int/mllam-testdata/danra_cropped/v0.2.0/single_levels.zarr
2025-09-26 09:14:25.917 | INFO     | __main__:_create_inference_datastore_config:149 - Replaced time dimension with ['analysis_time', 'elapsed_forecast_duration'] for state
2025-09-26 09:14:25.917 | INFO     | __main__:_create_inference_datastore_config:149 - Replaced time dimension with ['analysis_time', 'elapsed_forecast_duration'] for forcing
2025-09-26 09:14:25.917 | INFO     | mllam_data_prep.create_dataset:create_dataset:169 - Loading dataset danra_surface from https://object-store.os-api.cci1.ecmwf.int/danra/v0.6.0dev1/single_levels.zarr/
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/ops/loading.py:21: FutureWarning: In a future version, xarray will not decode the variable 'elapsed_forecast_duration' into a timedelta64 dtype based on the presence of a timedelta-like 'units' attribute by default. Instead it will rely on the presence of a timedelta64 'dtype' attribute, which is now xarray's default way of encoding timedelta64 values.
To continue decoding into a timedelta64 dtype, either set `decode_timedelta=True` when opening this dataset, or add the attribute `dtype='timedelta64[ns]'` to this variable on disk.
To opt-in to future behavior, set `decode_timedelta=False`.
  ds = xr.open_zarr(fp)
2025-09-26 09:14:26.818 | INFO     | mllam_data_prep.create_dataset:create_dataset:183 - Extracting selected variables from dataset danra_surface
2025-09-26 09:14:26.823 | INFO     | mllam_data_prep.create_dataset:create_dataset:229 - Mapping dimensions and variables for dataset danra_surface to state
2025-09-26 09:14:28.346 | INFO     | mllam_data_prep.create_dataset:create_dataset:169 - Loading dataset danra_static from https://object-store.os-api.cci1.ecmwf.int/danra/v0.5.0/single_levels.zarr/
2025-09-26 09:14:28.884 | INFO     | mllam_data_prep.create_dataset:create_dataset:183 - Extracting selected variables from dataset danra_static
2025-09-26 09:14:28.885 | INFO     | mllam_data_prep.create_dataset:create_dataset:229 - Mapping dimensions and variables for dataset danra_static to static
2025-09-26 09:14:29.580 | INFO     | mllam_data_prep.create_dataset:_merge_dataarrays_by_target:76 - Merging dataarrays for target variable `state`
2025-09-26 09:14:29.588 | INFO     | mllam_data_prep.create_dataset:_merge_dataarrays_by_target:76 - Merging dataarrays for target variable `static`
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/create_dataset.py:109: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge(dataarrays, join="exact")
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/create_dataset.py:109: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge(dataarrays, join="exact")
2025-09-26 09:14:31.075 | INFO     | mllam_data_prep.create_dataset:create_dataset:262 - Chunking dataset with {'analysis_time': 1}
2025-09-26 09:14:31.229 | INFO     | mllam_data_prep.create_dataset:create_dataset:270 - Setting splitting information to define `['train', 'val', 'test']` splits along dimension `time`
2025-09-26 09:14:31.238 | INFO     | mllam_data_prep.create_dataset:create_dataset:305 - Adding pre-computed statistics to dataset
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/create_dataset.py:307: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge([ds, ds_stats], join="exact")
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/create_dataset.py:307: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge([ds, ds_stats], join="exact")
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/create_dataset.py:307: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge([ds, ds_stats], join="exact")
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/create_dataset.py:307: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge([ds, ds_stats], join="exact")
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/create_dataset.py:307: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge([ds, ds_stats], join="exact")
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/create_dataset.py:307: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge([ds, ds_stats], join="exact")
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/zarr/core/dtype/npy/string.py:248: UnstableSpecificationWarning: The data type (FixedLengthUTF32(length=9, endianness='little')) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
  v3_unstable_dtype_warning(self)
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/zarr/core/dtype/npy/string.py:248: UnstableSpecificationWarning: The data type (FixedLengthUTF32(length=5, endianness='little')) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
  v3_unstable_dtype_warning(self)
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/zarr/core/dtype/npy/string.py:248: UnstableSpecificationWarning: The data type (FixedLengthUTF32(length=15, endianness='little')) does not have a Zarr V3 specification. That means that the representation of arrays saved with this data type may change without warning in a future version of Zarr Python. Arrays stored with this data type may be unreadable by other Zarr libraries. Use this data type at your own risk! Check https://github.com/zarr-developers/zarr-extensions/tree/main/data-types for the status of data type specifications for Zarr V3.
  v3_unstable_dtype_warning(self)
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/zarr/api/asynchronous.py:233: ZarrUserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(
2025-09-26 09:14:37.051 | INFO     | __main__:_prepare_inference_dataset_zarr:225 - Saved inference dataset to inference_workdir/danra.datastore.zarr
2025-09-26 09:14:37.081 | INFO     | __main__:_create_inference_config:246 - Saved inference config to inference_workdir/config.yaml
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
The loaded datastore contains the following features:
 state   : pres_seasurface r2m t2m u10m v10m
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/neural_lam/datastore/mdp.py:214: UserWarning: no forcing data found in datastore
  warnings.warn("no forcing data found in datastore")
 static  : lsm orography
With the following splits (over time):
 train   : 1549281600000000000 to 1549281600000000000
 val     : 1549281600000000000 to 1549281600000000000
 test    : 1549281600000000000 to 1549303200000000000
Writing graph components to inference_workdir/graph/multiscale
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/torch_geometric/utils/convert.py:249: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_new.cpp:256.)
  data[key] = torch.tensor(value)
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Seed set to 42
The loaded datastore contains the following features:
 state   : pres_seasurface r2m t2m u10m v10m
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/neural_lam/datastore/mdp.py:214: UserWarning: no forcing data found in datastore
  warnings.warn("no forcing data found in datastore")
 static  : lsm orography
With the following splits (over time):
 train   : 1549281600000000000 to 1549281600000000000
 val     : 1549281600000000000 to 1549281600000000000
 test    : 1549281600000000000 to 1549303200000000000
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/neural_lam/datastore/mdp.py:214: UserWarning: no forcing data found in datastore
  warnings.warn("no forcing data found in datastore")
Loaded graph with 523770 nodes (464721 grid, 59049 mesh)
Edges in subgraphs: m2m=527096, g2m=875571, m2g=1858884
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
wandb: Currently logged in as: leifdenby (mllam) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.22.0
wandb: Run data is saved locally in ./wandb/run-20250926_091515-z1cwfia2
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run eval-test-graph_lam-4x2-09_26_09-6510
wandb: ⭐️ View project at https://wandb.ai/mllam/neural_lam
wandb: 🚀 View run at https://wandb.ai/mllam/neural_lam/runs/z1cwfia2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W926 09:15:16.635975000 ProcessGroupGloo.cpp:545] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/neural_lam/datastore/mdp.py:294: UserWarning: no forcing data found in datastore
  warnings.warn("no forcing data found in datastore")
Restoring states from the checkpoint path at ./inference_artifact/checkpoint.pkl
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:445: The dirpath has changed from '/Users/B280936/git-repos/mllam/neural-lam/saved_models/train-graph_lam-4x2-05_15_17-2301' to '/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/saved_models/eval-test-graph_lam-4x2-09_26_09-6510', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.
Loaded model weights from the checkpoint at ./inference_artifact/checkpoint.pkl
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Testing DataLoader 0:   0%|                                                                                                                                                                                                                                                      | 0/1 [00:00:128: RuntimeWarning: 'mllam_data_prep.recreate_inputs' found in sys.modules after import of package 'mllam_data_prep', but prior to execution of 'mllam_data_prep.recreate_inputs'; this may result in unpredictable behaviour
2025-09-26 09:15:43.393 | WARNING  | __main__:recreate_inputs:127 - Target output variable static for input dataset danra_static not found in dataset, skipping
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/mllam_data_prep/recreate_inputs.py:85: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.
  ds = xr.merge(dataarrays, join="exact")
2025-09-26 09:15:43.441 | INFO     | __main__:main:321 - Saving input dataset danra_surface to ./inference_workdir/outputs/danra_surface.zarr with chunks={}
/Users/B280936/git-repos/mllam/mlwm-deployment/configurations/surface-dummy-model_DINI/.venv/lib/python3.11/site-packages/zarr/api/asynchronous.py:233: ZarrUserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(
Renaming ./inference_workdir/outputs/danra_surface.zarr to ./inference_workdir/outputs/single_levels.zarr
 ✝  mlwm-deployment/configurations/surface-dummy-model_DINI   feat/forecast-inference-dataset-creation±  uvx zarrdump ./inference_workdir/outputs/single_levels.zarr
Installed 17 packages in 153ms
 Size: 56MB
Dimensions:                    (elapsed_forecast_duration: 6, x: 789, y: 589)
Coordinates:
    analysis_time              datetime64[ns] 8B ...
  * elapsed_forecast_duration  (elapsed_forecast_duration) timedelta64[ns] 48B ...
    time                       (elapsed_forecast_duration) datetime64[ns] 48B ...
  * x                          (x) float64 6kB -1.999e+06 ... -2.925e+04
  * y                          (y) float64 5kB -6.095e+05 ... 8.605e+05
Data variables:
    pres_seasurface            (elapsed_forecast_duration, x, y) float32 11MB ...
    r2m                        (elapsed_forecast_duration, x, y) float32 11MB ...
    t2m                        (elapsed_forecast_duration, x, y) float32 11MB ...
    u10m                       (elapsed_forecast_duration, x, y) float32 11MB ...
    v10m                       (elapsed_forecast_duration, x, y) float32 11MB ...
Attributes:
    recreated_from:       ./inference_workdir/outputs/inference_output.zarr
    recreation_config:    dataset-version: v0.1.0\nextra:\n  projection:\n   ...
    source_dataset_name:  danra_surface
    created_by:           mllam_data_prep.recreate_inputs
    created_on:           2025-09-26T07:15:43.441570+00:00
    mdp-version:          0.6.1

@leifdenby leifdenby changed the title cli for building forecast inference datasets First inference run from inference artifact Sep 26, 2025
@leifdenby leifdenby changed the title First inference run from inference artifact First model configuration with inference run from inference artifact Sep 26, 2025
@leifdenby leifdenby marked this pull request as ready for review September 26, 2025 07:22
@leifdenby leifdenby requested a review from khintz September 26, 2025 07:22
@khintz
Copy link
Copy Markdown
Contributor

khintz commented Sep 26, 2025

Steps 3. and 4. require upstream changes to mllam-data-prep and neural-lam, which are detailed in the model configuration README. I suggest we merge this model configuration now and then I can work on getting these upstream changes in after, and then update the model configuration pyproject.toml file to point to main branches of neural-lam and mllam-data-prep, rather than the development branches that it currently points to

Agree. Will review with this in mind.

Copy link
Copy Markdown
Contributor

@khintz khintz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor things and some questions.
Great work !

COPY pyproject.toml .
COPY *.yaml ./
COPY entry.sh ./
COPY src/ ./src
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect we will have more under src in longer term? Just wondering if it gives a better overview by just having create_inference_dataset.py in the root of each configuration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could do that yes, maybe that is better. It was just that using src/ follows the convention of python scripts for a package (here called surface-dummy-model_DINI) should reside in a subdirectory, named src/ by default.

Comment on lines +3 to +12
The model configuration in this directory is a dummy model that was trained on
surface variables from DANRA, only 10 days of data and only trained 10
epochs. It is intended only as a demonstration of the inference pipeline and is
expected to give very poor results.

## Upstream package change requirements

Relative to the `main` branch on both github.com/mllam/mllam-data-prep and
github.com/mllam/neural-lam and number of pieces of functionality are currently
required to run this configuration:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, are there a reason for line breaks here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just find it easier to read, I can remove them if you prefer :) I think this might be a pre-commit markdown default too, but I'm not sure...

Comment thread configurations/surface-dummy-model_DINI/entry.sh Outdated
# b) include the statistics from the training dataset and
# c) set the dimensions in the configuration to have `analysis_time` and
# `elapsed_forecast_duration` instead of just `time`.
uv run python src/create_inference_dataset.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked that it was easy to see the command that was needed to create the inference data, but I am also sure you have a good reason for creating the python script instead. Is it to handle configs and chunking?

Copy link
Copy Markdown
Contributor Author

@leifdenby leifdenby Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is quite a few steps to this so I thought it was best to have in an isolated python script. I added env vars to make it clear what this script depends on in leifdenby@5f56aef#diff-e44c8d78f75e8f0f19d6f563bb5cfa93328cd98925a52573ed2706e0a370e8a7R10-R86 and leifdenby@62bd766#diff-e44c8d78f75e8f0f19d6f563bb5cfa93328cd98925a52573ed2706e0a370e8a7R89

danra_surface=f"{S3_BUCKET_URL}/v0.6.0dev1/single_levels.zarr/",
danra_static=f"{S3_BUCKET_URL}/v0.5.0/single_levels.zarr/",
)
ANALYSIS_TIME = "2019-02-04T12:00"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work for the POC, but we should think about adding this as an argument in "non-POCs". Maybe add a comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I made this an argument that could be change at runtime in leifdenby@5f56aef

@leifdenby
Copy link
Copy Markdown
Contributor Author

Thanks for your review @khintz! After disabling W&B I have also been able build and image and run inference with the DANRA forecast data i put on EWC on super-juice inside a container 🥳

I'm working on some further improvements to this PR to address your comments and then I will reply to your comments.

@leifdenby
Copy link
Copy Markdown
Contributor Author

leifdenby commented Sep 30, 2025

I'm getting there, but here are some further things I need to fix:

  • the projection info needs to be update from the training datastore config. We shouldn't really have hardcoded the projection info in the datastore config, rather it should be read from the source data, but that is what we have for now
  • there is a bug in neural-lam doing ds.sel(time=ds.splits.t_start.item()) where t_start.item() becomes an int if it is a np.datetime64[ns] rather than np.datetime64[s]
  • there is no python3.11 manylinux wheel for torch on ohm.dmi.dk, and so because zarr requires python>=3.11 we have to use zarr2
  • write dev notes for ohm (on using venv in squashed with uv, and .env)
  • fix warning in mllam-data-prep feature for merging stats with training dataset when longnames differ

- expose container build program with env var so we can use docker on
  DGX Spark
- use nvidia docker registry for ARM nvidia image on ARM platforms
- install with pip directly into system python site-packages (with
  overwrite) while setting torch version constraint (otherwise
  dependencies install overwrites the pytorch install)
- optionally use pip directly inside container (rather than via uv) to
  ensure we use system python without venv
@khintz
Copy link
Copy Markdown
Contributor

khintz commented Nov 6, 2025

With Spark, should we ignore the issues on ohm for now?

@khintz
Copy link
Copy Markdown
Contributor

khintz commented Nov 6, 2025

As agreed, we merge now and continue from main.

@khintz khintz merged commit c43d522 into dmidk:main Nov 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants