Small Lessons, Big Learner: Fine Tuning of ESM-2 Protein Language Model Through Knowledge Distillation

Authors: Edir Sebastian Vidal Castro, Florencia De Lillo, Kacper Maciejewski, Rodrigo Gallegos Dextre
Supervisor: Jonathan Funk

Repository for the final project of the 02456 Deep Learning (Fall 2024) course at Technical University of Denmark.

Repository outline

PLM_PROJECT/
│
├── bin/                                        # Playground directory for tries and errors
│   ├── example_data/                           # Examples of input data format
│   │   ├── uniprot_data.cvs
│   │   └── uniref_id_UniRef100_A0A003_OR_id_UniR_2024_11_17.ts
|   |
│   ├── data_preprocessing_prototype.ipynb      # Legacy code for data loading
│   ├── testing_loss_functions.ipynb            # Legacy code for loss implementation
│   └── testing_training_loop.ipynb             # Legacy code for trainng loop
|
├── src/                                        # Main source code
│   ├── evaluation/                             # Create plots from training's output
│   │   ├── acc_perplex.py                      # Create boxplots for the poster
|   |   ├── download_mlflow_results.py          # Get tracked training parameters and metrics
│   │   └── line_plots.py                       # Create line plots for the poster
│   │
│   ├── training/                               # Training loop for knowledge distilation
│   │   ├── get_logits.py                       # Precomputes logits from the student model
│   │   ├── get_reps.py                         # Precomputes representations from the student model
│   │   └── training_loop.py                    # Knowledge-destillation loop on precomputed results
│   │
│   ├── utils/                                  # Training-related code
│   │   ├── data_utils.py                       # Data loader with taxonomy-oriented batching
|   |   ├── loss_functions.py                   # Dual-loss implementation
│   │   └── token_mask.py                       # Multiprocessing-enabled sequence masking

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
bin		bin
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Lessons, Big Learner: Fine Tuning of ESM-2 Protein Language Model Through Knowledge Distillation

Repository outline

Poster

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

CoditoDTU/PLM_project

Folders and files

Latest commit

History

Repository files navigation

Small Lessons, Big Learner: Fine Tuning of ESM-2 Protein Language Model Through Knowledge Distillation

Repository outline

Poster

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages