Add Image Tokenizer #8

LTMeyer · 2025-04-07T20:26:20Z

Context

This PR is related to #6 (comment).
We want to be able to ship the model and tokenizers as standalone classes to simplify the readability, maintenance, and distribution of the code. Currently, the only way to load trained tokenizer is using the load_tokenizer method that relies on torch.package. It loads all the dependencies stored with the trained tokenizer. It hides the core classes and their logic.

This PR rewrites the image tokenizer saved at /mnt/ceph/users/polymathic/mmoma/outputs/multisurvey/a88h9lef/checkpoints/HSC+DECaLSCodec-epoch=02.ckpt.pt.

Process

Code Generation

The process to rewrite the code is:

extract the original code with inspect (see Beta testers program todo #6 (comment) for more details)
remove unecessary parts of the different classes extracted (e.g. pytorch lighting attributes related to training)
factorize classes (use torch.nn.module instead of functions to implement models)

Model Weights Retrieval

Based on the original state dict, we can change the keys to generate a new state dict that is compatible with the model.
I have checked that we get the same output from the same random input.

Questions

To decide if we want to merge this PR and proceed with the remaining tokenizers (39 according to @EiffL) we should answer the following questions in addition to reviewing the code.

The process described above involve rewriting partially many classes that are present in https://github.com/PolymathicAI/MMOMA. Is this something we want in the long term? Would we have only one repo with AION superseding MMOA? Here, we could have one internal repo for experimentation and a public one for release.
The current process removes the methods related to training and pytorch-lightning. Consequently, it is not straightforward to train the tokenizers from the current code. Is it something we can afford to abandon for the release? BTW, a recommended process for the long term is to split pure pytorch classes and pytorch-lightning encapsulation for training.

TODO: Clarify the difference between encode and quantize

EiffL · 2025-04-08T14:50:21Z

excellent!

To your 2 questions.... I think we would need to do the same exercise only for a few additional tokenizers:

spectrum tokenizer
"catalog" tokenizer
"scalar-field" tokenizer
scalar tokenizer (there are 3 variants I think: linear, log scale, fixed scale)
these are the base classes, then for most scalar modalities, they use one of these scalar tokenizer variants.

Regarding whether it's a problem to not have training code and such, the answer is no. We can release separately the training code and everything that comes with it. It's also not fundamentally a problem that MMOMA is continuing to evolve separately. When codecs for AION-2 are ready, we do a bit of code cleaning such that it will be easier to export them correctly.

aion/modules/__init__.py

Copilot

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

aion/utils.py

aion/tokenizers/base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

EiffL · 2025-04-08T15:13:02Z

lol, thanks Copilot, very in depth review you did there

aion/codecs/quantizers/__init__.py

LTMeyer · 2025-05-14T02:14:25Z

I've checked locally that we can reproduce the same encoded output for a batch from legacysurvey.

I would like to upload the model on HF (privately to start with). I could then fetch directly the weights from HF in the tests otherwise I would need to upload the 200MB checkpoint.

Test that uploaded model reproduces previous predictions.

Rename test data for image codec.

Testing fixes to online tests

LTMeyer added 11 commits April 7, 2025 12:04

Add base class for quantizers

83b5582

Add base class for tokenizers

23c75bf

Change arborescence for tokenizers

b250bf4

Add utils functions

dc75b8c

Add MagVitAE model

8389823

Add subsampler module

b51e963

Add FiniteScale quantizer

57926ab

Use quantizer encode instead of quantize

2733ce7

TODO: Clarify the difference between encode and quantize

Add MagVitAE image tokenizer

c190989

Add test for MagVitAE image tokenizer

cd1f705

Add tests to CI

71ea594

LTMeyer requested review from EiffL, al-jshen and lhparker1 April 7, 2025 20:26

LTMeyer added 2 commits April 7, 2025 22:27

Fix github CI

2f531e0

Remove unnecessary package listing to fix tests

c22e5fb

LTMeyer requested a review from swagnercarena April 8, 2025 11:52

EiffL reviewed Apr 8, 2025

View reviewed changes

aion/modules/__init__.py Outdated Show resolved Hide resolved

EiffL requested a review from Copilot April 8, 2025 14:58

Copilot AI reviewed Apr 8, 2025

View reviewed changes

aion/utils.py Outdated Show resolved Hide resolved

aion/tokenizers/base.py Outdated Show resolved Hide resolved

EiffL and others added 2 commits April 8, 2025 11:10

Update aion/utils.py

7ec31f9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update aion/tokenizers/base.py

f4c3bff

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

LTMeyer added 2 commits April 9, 2025 11:20

Add ruff cache to gitignore

806da4b

Move tokenizers to dedicated codecs module

0b5c6e4

LTMeyer commented Apr 9, 2025

View reviewed changes

aion/codecs/quantizers/__init__.py Outdated Show resolved Hide resolved

LTMeyer mentioned this pull request Apr 9, 2025

Add gz10 codec #9

Closed

al-jshen reviewed Apr 9, 2025

View reviewed changes

aion/codecs/quantizers/__init__.py Outdated Show resolved Hide resolved

Rename FiniteScaleQuantizer->FiniteScalarQuantizer

9cd08ba

LTMeyer and others added 27 commits May 13, 2025 22:18

Update test to load only one model checkpoint

aa71c1e

Upload data to HF

66b2da2

Test that uploaded model reproduces previous predictions.

Fix weight_only loading default value

6e61347

Download git lfs files in github actions

e82942c

Fix tokenizer decode method

627130b

Add image codedc decoded batch to test data.

f2f6d10

Rename test data for image codec.

Merge branch 'main' into add_tokenizers

88b75da

Add missing run keyword to CI

660f0fd

Add numpy to dependencies

9156c0a

Restore HF token in CI

16b2232

Restore lfs checkout in CI

fdbe656

Update huggingface_hub dependency

8097f35

Investigate why image tokenizer test is failing

cb426a8

Bis

9823d2f

Ter

e3576b0

Remove gitattributes to get rid of lfs

06f246d

Remove test data lfs from repo

802495b

Add test data for image tokenizer without lfs

f4a7b73

Update pyproject.toml

9b531d0

Update test.yaml

e94d703

Update test_image_tokenizer.py

7a29aa7

Update test_image_tokenizer.py

536e2df

Update pyproject.toml

682c48c

Merge pull request #12 from PolymathicAI/add_tokenizers_tests

e9e77cc

Testing fixes to online tests

Rename MagViTAEImageCodec to ImageCodec

a9d63c5

Prepare migration to lfs

d1b9f58

Track tokenizer test data with lfs

e050180

LTMeyer merged commit e9996f8 into main May 22, 2025
2 checks passed

LTMeyer deleted the add_tokenizers branch May 22, 2025 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Image Tokenizer #8

Add Image Tokenizer #8

Uh oh!

LTMeyer commented Apr 7, 2025

Uh oh!

EiffL commented Apr 8, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

EiffL commented Apr 8, 2025

Uh oh!

Uh oh!

Uh oh!

LTMeyer commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add Image Tokenizer #8

Add Image Tokenizer #8

Uh oh!

Conversation

LTMeyer commented Apr 7, 2025

Context

Process

Code Generation

Model Weights Retrieval

Questions

Uh oh!

EiffL commented Apr 8, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

EiffL commented Apr 8, 2025

Uh oh!

Uh oh!

Uh oh!

LTMeyer commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants