Skip to content

Conversation

@LTMeyer
Copy link
Collaborator

@LTMeyer LTMeyer commented Apr 7, 2025

Context

This PR is related to #6 (comment).
We want to be able to ship the model and tokenizers as standalone classes to simplify the readability, maintenance, and distribution of the code. Currently, the only way to load trained tokenizer is using the load_tokenizer method that relies on torch.package. It loads all the dependencies stored with the trained tokenizer. It hides the core classes and their logic.

This PR rewrites the image tokenizer saved at /mnt/ceph/users/polymathic/mmoma/outputs/multisurvey/a88h9lef/checkpoints/HSC+DECaLSCodec-epoch=02.ckpt.pt.

Process

Code Generation

The process to rewrite the code is:

  • extract the original code with inspect (see Beta testers program todo #6 (comment) for more details)
  • remove unecessary parts of the different classes extracted (e.g. pytorch lighting attributes related to training)
  • factorize classes (use torch.nn.module instead of functions to implement models)

Model Weights Retrieval

Based on the original state dict, we can change the keys to generate a new state dict that is compatible with the model.
I have checked that we get the same output from the same random input.

Questions

To decide if we want to merge this PR and proceed with the remaining tokenizers (39 according to @EiffL) we should answer the following questions in addition to reviewing the code.

  • The process described above involve rewriting partially many classes that are present in https://github.com/PolymathicAI/MMOMA. Is this something we want in the long term? Would we have only one repo with AION superseding MMOA? Here, we could have one internal repo for experimentation and a public one for release.
  • The current process removes the methods related to training and pytorch-lightning. Consequently, it is not straightforward to train the tokenizers from the current code. Is it something we can afford to abandon for the release? BTW, a recommended process for the long term is to split pure pytorch classes and pytorch-lightning encapsulation for training.

@LTMeyer LTMeyer requested review from EiffL, al-jshen and lhparker1 April 7, 2025 20:26
@LTMeyer LTMeyer requested a review from swagnercarena April 8, 2025 11:52
@EiffL
Copy link
Contributor

EiffL commented Apr 8, 2025

excellent!

To your 2 questions.... I think we would need to do the same exercise only for a few additional tokenizers:

  • spectrum tokenizer
  • "catalog" tokenizer
  • "scalar-field" tokenizer
  • scalar tokenizer (there are 3 variants I think: linear, log scale, fixed scale)
    these are the base classes, then for most scalar modalities, they use one of these scalar tokenizer variants.

Regarding whether it's a problem to not have training code and such, the answer is no. We can release separately the training code and everything that comes with it. It's also not fundamentally a problem that MMOMA is continuing to evolve separately. When codecs for AION-2 are ready, we do a bit of code cleaning such that it will be easier to export them correctly.

@EiffL EiffL requested a review from Copilot April 8, 2025 14:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

EiffL and others added 2 commits April 8, 2025 11:10
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@EiffL
Copy link
Contributor

EiffL commented Apr 8, 2025

lol, thanks Copilot, very in depth review you did there

@LTMeyer LTMeyer mentioned this pull request Apr 9, 2025
@LTMeyer
Copy link
Collaborator Author

LTMeyer commented May 14, 2025

I've checked locally that we can reproduce the same encoded output for a batch from legacysurvey.

I would like to upload the model on HF (privately to start with). I could then fetch directly the weights from HF in the tests otherwise I would need to upload the 200MB checkpoint.

@LTMeyer LTMeyer merged commit e9996f8 into main May 22, 2025
2 checks passed
@LTMeyer LTMeyer deleted the add_tokenizers branch May 22, 2025 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants