Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,10 @@ __pycache__/
build/
dist/
.venv/
.ipynb_checkpoints/
src/mlcast/modules/.ipynb_checkpoints/
src/mlcast/models/ldcast/context/.ipynb_checkpoints/
src/mlcast/models/ldcast/diffusion/.ipynb_checkpoints/
src/mlcast/models/ldcast/.ipynb_checkpoints/
src/mlcast/models/ldcast/autoenc/.ipynb_checkpoints/
src/mlcast/models/ldcast/blocks/.ipynb_checkpoints/
89 changes: 24 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,42 @@
# mlcast
# MLCast implementation of LDCast

<!-- SPDX-License-Identifier: Apache-2.0 OR BSD-3-Clause -->
see main branch https://github.com/mlcast-community/mlcast for context.

The MLCast Community is a collaborative effort bringing together meteorological services, research institutions, and academia across Europe to develop a unified Python package for AI-based nowcasting. This is an initiative of the E-AI WG6 (Nowcasting) of EUMETNET.
## Code structure

This repo contains the `mlcast` package for machine learning-based weather nowcasting.
There is one main `LDCast` class, subclassing the `NowcastingModelBase` class. There are three main nets in LDCast:
- the autoencoder
- the conditioner
- the denoiser

## Project Status
The `NowcastingLightningModule` is subclassed by the smaller composites of nets that should be trained at once. This gives two subclasses in this case:
- the autoencoder (encoder + decoder) has to be trained on its own, so there is one subclass of `NowcastingLightningModule` called `Autoencoder`
- the conditioner and the denoiser have to be trained together, so they are combined into one neural network (the `LatentDiffusionNet` class), whose training is handled by the `LatentDiffusion` subclass of the `NowcastingLightningModule`

⚠️ **Under Development** - This package is currently in early development stages and not usable by end users. The API and functionality are subject to change.
## Documentation

## Installation
```bash
# Install from pypi
pip install mlcast
```
See `docs` folder for some documenation on the main `LDCast` class, on the autoencoder and on the latent diffusion part.

or
```bash
# Install from source
git clone https://github.com/mlcast-community/mlcast
cd mlcast
uv pip install -e .
## TO DO

# For development
uv pip install -e ".[dev]"
```
reorganize the `LatentDiffusion` class ? for the moment, `LatentDiffusionNet.forward` is never called during inference because the inference process is quite different than in training (see `docs/ldm.md). It might be maybe a bit clearer to reorganize that by implementing explicitly different training and inference step methods in the `LatentDiffusion` class (that being said, `AutoencoderKLNet.forward` is never called either during inference)

## Project Structure
The 'timesteps' variable sometimes refers to the timesteps of the diffusion process (= 1000 during training) and sometimes refers to the nowcasting timesteps (where each time step = 5 minutes). Better to have different names.

```
mlcast/
├── src/mlcast/ # Main package source code
│ ├── __init__.py # Package initialization and version
│ ├── data/ # Data loading and preprocessing
│ │ ├── zarr_datamodule.py # PyTorch Lightning data module for Zarr
│ │ └── zarr_dataset.py # PyTorch dataset for Zarr arrays
│ ├── models/ # Lightning model implementations
│ │ └── base.py # Abstract base classes for nowcasting models
│ └── modules/ # Pure PyTorch neural network modules
│ └── convgru_modules.py # ConvGRU encoder-decoder modules
├── examples/ # Example scripts and notebooks
│ └── scripts/
│ └── simple_train.py # Basic training example
├── pyproject.toml # Project metadata and dependencies
├── LICENSE # Apache 2.0 license
└── README.md # This file
```
We might integrate this code within the Hugging Face Diffusers Library.

## Development
It remains mainly to write code in the main LDCast class (in `ldcast.py`)

This project uses `uv` for dependency management. To set up the development environment:
It would be nice to rewrite the PLMS sampler, it is a little messy

```bash
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
implement different parametrization than 'eps'

# Install dependencies
uv sync
use ZarrDataModule and ZarrDataset !

# Run pre-commit hooks
uv run pre-commit install
```
add the computation of the EMA loss during the ldm training, change the LDCast.predict method so that EMA weights are automatically used during inference

## Contributing
add in the code (and in the doc) the input and output shapes of the nets

Please feel free to raise issues or PRs if you have any suggestions or questions.
understand which parameters can be changed, which have to be adapted when others change

## Links to presentations for discussion about the API

- [2025/02/04 first design discussions](https://docs.google.com/presentation/d/1oWmnyxOfUMWgeQi0XyX4fX9YDMX1vl6h/edit?usp=drive_link&rtpof=true&sd=true)

## License

This project is dual-licensed under either:

* Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* BSD 3-Clause License ([LICENSE-BSD](LICENSE-BSD) or https://opensource.org/licenses/BSD-3-Clause)

at your option.

See [LICENSE](LICENSE) for more details.
make the implementation of the `AutoencoderDataset` more efficient ? (see docs/autoencoder)
94 changes: 94 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
model:
autoencoder:
optimizer_class: "${as_class: 'torch.optim.AdamW'}"
optimizer_kwargs:
lr: 0.001
betas: [0.5, 0.9]
weight_decay: 0.001
lr_scheduler:
class: "${as_class: 'torch.optim.lr_scheduler.ReduceLROnPlateau'}"
kwargs:
patience: 3
factor: 0.25
extra:
monitor: 'val/rec_loss'
frequency: 1
interval: 'epoch'
antialiaser:
use: True
kwargs: {}
encoder: {}
decoder: {}
net_kwargs:
hidden_width: &autoencoder_hidden_width 32
loss:
kl_weight: 0.01
trainer:
max_epochs: 200
accelerator: 'gpu'
log_every_n_steps: 5
callbacks: "${as_class: '[pl.callbacks.EarlyStopping(\"val/loss_epoch\", patience=6, verbose=True, check_finite=False)]'}"
strategy: 'ddp'
num_nodes: 1
sync_batchnorm: True
dataloader:
batch_size: 1
num_workers: 0
persistent_workers: False

ldm:
conditioner:
autoencoder_dim: *autoencoder_hidden_width
output_patches: &output_patches 5
cascade_depth: 3
embed_dim: 128
analysis_depth: 4
denoiser:
in_channels: *autoencoder_hidden_width
model_channels: 256
out_channels: *autoencoder_hidden_width
num_res_blocks: 2
attention_resolutions: [1, 2]
dims: 3
channel_mult: [1, 2, 4]
num_heads: 8
num_timesteps: *output_patches
context_ch: [128, 256, 512] # should be equal to conditioner.cascade_dims ?
ema:
use: True
kwargs:
store_device: 'cuda'
optimizer_class: "${as_class: 'torch.optim.AdamW'}"
optimizer_kwargs:
lr: 0.0001
betas: [0.5, 0.9]
weight_decay: 0.001
lr_scheduler:
class: "${as_class: 'torch.optim.lr_scheduler.ReduceLROnPlateau'}"
kwargs:
patience: 3
factor: 0.25
extra:
monitor: 'val/loss' # is actually the ema loss, since the ema weights are used for validation
frequency: 1
interval: 'epoch'
scheduler: {} # diffusion scheduler
trainer:
max_epochs: 200
accelerator: 'gpu'
log_every_n_steps: 5
callbacks: "${as_class: '[pl.callbacks.EarlyStopping(\"val/loss_epoch\", patience=6, verbose=True, check_finite=False)]'}"
strategy: 'ddp'
num_nodes: 1
sync_batchnorm: True
dataloader:
batch_size: 1
num_workers: 0
persistent_workers: False

sampled_radar_dataset:
zarr_path: '/scratch/martinbo/MLCast/radklim.zarr'
csv_path: '/scratch/martinbo/MLCast/LDCastTraining/indexes_radklim/sampled_datacubes_2001-01-01-2001-01-01_24x256x256_3x16x16_1500000.csv'
steps: 24
augment: False
data_var: 'RR'
80 changes: 80 additions & 0 deletions docs/autoencoder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Autoencoder documentation

1. [Autoencoder class](#autoencoder-class)
2. [Tensor shapes](#tensor-shapes)
3. [Encoding and decoding](#encoding-and-decoding)
4. [Loading original weights](#loading-original-weights)
5. [Antialiasing](#antialiasing)
6. [Autoencoder training dataset](#autoencoder-training-dataset)
7. [Background on variational autoencoders](#background-on-variational-autoencoders)

## Autoencoder class

The `Autoencoder` class is a subclass of `NowcastingLightningModule`, and takes three arguments:
- the `net` (an instance of `AutoencoderKLNet` for LDCast), which is the neural network of the autoencoder, containing the decoder and the autoencoder
- the `loss` (an instance of `AutoencoderLoss` for LDCast)
Options for the optimizer and the learning rate scheduler can be passed as well.

An instance can be created from a `dict` containing the configuration, based on the architecture of LDCast's autoencoder:
```python
from mlcast.models.ldcast.autoencoder.autoencoder import Autoencoder
autoencoder = Autoencoder.from_config(config)
```

## Tensor shapes

The autoencoder encodes sequences of radar images (not image by image). The number of radar images encoded at once is given by `autoenc_time_ratio` and was set to 4 in the original code (and kept here). `Conv3d` layers are used for the encoding, so input tensors have shape
```
(batch_size, n_channels, autoenc_time_ratio,) + spatial shape
```
`n_channels` is always 1 for radar images.

In latent space, the tensors have shape `(batch_size, 32, n, 64, 64)`, where 32 is the `hidden_width` of the `autoencoder` and `n` is the number of consecutive encoded radar images divided by `autoenc_time_ratio`. **I should still clarify which of these parameters can be changed freely, and how it affects other shapes. Can `autoencoder.net` encode a e.g. 8 images at once (in which case `n` is 2) ?**


## Encoding and decoding

Doing the following
```python
import torch
inputs = torch.randn(1, 1, 4, 256, 256, device = 'cuda') # fake sample
autoencoder(inputs)
```
is equivalent to `autoencoder.net(inputs)` and computes the whole forward pass through the `net` (encoding + decoding). To encode only, one needs to do
```python
autoencoder.net.encode(inputs).
```
If `encoded` is an encoded sample, it can be decoded as
```python
autoencoder.net.decode(encoded)
```

## Laoding original weights

The original weights can be loaded directly as
```python
autoenc_weights_fn = '/path/to/original/autoencoder/weights'
autoencoder.net.load_state_dict(torch.load(autoenc_weights_fn))
```

## Antialiasing

As in the original code, antialiasing is applied by default (by an Antialiaser object) to the inputs before being fed to the `net`.

## Autoencoder training dataset

Gabriele's code produces a dataset whose samples are sequences of `steps` images (`steps` is usually set to 24, to have 4 input images and 20 ground truth images).

But the autoencoder needs samples which are sequences of only 4 images, so each sample in `SampledRadarDataset` needs to be divided in 6 samples. This is done by the `AutoencoderDataset`. Its samples are tuple `(x, y)` where `y = x` since we want the autoencoder to reconstruct the sequences.

**The current implementation of this class is not the most efficient since, when going through the `AutoencoderDataset`, each sample of the `SampledRadarDataset` is loaded 6 times.**

## Background on variational autoencoders

The autoencoder used in LDCast is a variational autoencoder. Here is some background on that kind of autoencoder.

Source https://medium.com/@jpark7/finally-a-clear-derivation-of-the-vae-kl-loss-4cb38d2e47b3.

Variational autoencoders encode the data through a normal distribution in latent space: each sample is represented by the mean and the standard deviation of the normal distribution. When decoding the sample, a new sample is created resembling the original sample, but is not quite the same. The degree to which we force the decoded samples to resemble the original ones is tuned by the `kl_weight` parameter of the KL loss function.

When using the encoded sample (for example to produce a condition with the conditioner), only the mean is used. In the original code, `autoencoder.net.decode` was returning a tuple `(mean, log_var)`, so that one had to select the mean with `autoencoder.net.decode(x)[0]`, which is not very clear. I replaced this by adding a keyword `return_log_var` in `autoencoder.net.decode`.
57 changes: 57 additions & 0 deletions docs/ldcast.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Main LDCast class documentation

1. [LDCast class](#ldcast-class)
2. [Inference](#inference)
3. [Loading/saving weights](#loading/saving-weights)
4. [Training](#training)

## LDCast class

The `LDCast` class is a subclass of `NowcastingModelBase` and takes three arguments
- the `ldm` (typically, an instance of `LatentDiffusion`)
- the `autoencoder` (typically, an instance of `Autoencoder`)
- the `sampler`

An instance can be created from a `dict` containing the configuration, based on the architecture of LDCast:
```python
from mlcast.models.ldcast.ldcast import LDCast
ldcast = LDCast.from_config(config)
```
A config very close to what was used in the original code is in 'config.yaml'. It should be loaded as
```python
from omegaconf import OmageConf
OmegaConf.register_new_resolver("as_class", lambda class_name: eval(class_name))
config = OmegaConf.load('config.yaml')
```

## Inference

Predictions can be produced with
```python
import torch
inputs = torch.randn(1, 1, 4, 256, 256, device = 'cuda') # fake data
ldcast.predict(inputs)
```
**Do not use for the moment, since the EMA weights (if used) are not automatically used for inference**

## Loading/saving weights
To load from a folder containing in different files the weights of the autoencoder, of the denoiser and of the conditioner (and possibly ema weights):
```python
ldcast.load('/path/to/folder')
```
To save in a folder:
```python
ldcast.save('/path/to/folder')
```

## Training

If `sampled_radar_dataset` is a `SampledRadarDataset` built with Gabriele's code (https://github.com/DSIP-FBK/ConvGRU-Ensemble/blob/main/convgru_ensemble/datamodule.py), the autoencoder can be trained with
```python
ldcast.fit_autoencoder(sampled_radar_dataset)
```
and the ldm can be trained
```python
ldcast.fit_ldm(sampled_radar_dataset)
```
Keyword arguments can be passed to the trainer and the dataloader through the `trainer_kwargs` and `dataloader_kwargs` keywords.
Loading