Skip to content

WaymentSteeleLab/Evo_or_Thermo

Repository files navigation

Disentangling RNA evolution and thermodynamics in genomic language models — reproducibility code

This repository contains the code and processed result tables used to generate the figures for the manuscript: “Disentangling RNA evolution and thermodynamics in genomic language models.”

It is intended to (1) reproduce the plotting/figure panels from precomputed result tables, and (2) provide single-sequence scripts to compute model-derived pairwise signals (CJ/REDIAL) and thermodynamic base-pairing probabilities (BPP) for custom inputs.

About: reproducibility code and processed tables for CJ-based pairwise dependency analyses and thermodynamic baselines in genomic language models.

What is included

  • Codes_for_figures/: Jupyter notebooks used to generate Figures 1–5.
  • Processed_results/: processed CSV tables used by the figure notebooks (subfolders: 6dataset/, RFAM/).
  • Candidate_csv/: small metadata / example input tables used by notebooks.
  • Single_sequence_python_script/: command-line scripts for single-sequence runs (CJ/BPP/REDIAL) and helper analyses.

Quick start (re-generate figures from processed tables)

  1. Create an environment with standard scientific Python packages (see requirements_figures.txt).
  2. Start Jupyter in this repo and run the notebooks in Codes_for_figures/:
    • Codes_for_figures/Figure1.ipynb
    • Codes_for_figures/Figure2.ipynb
    • Codes_for_figures/Figure3.ipynb
    • Codes_for_figures/Figure4.ipynb
    • Codes_for_figures/Figure5.ipynb

By default, the notebooks write figure files under Codes_for_figures/_outputs/ (created at runtime).

Single-sequence scripts (CJ / BPP / REDIAL)

In addition, Single_sequence_python_script/generate_synthetic_rnas_for_fig5.ipynb generates structure-matched synthetic sequences used in Figure 5 analyses.

The scripts in Single_sequence_python_script/ take a single sequence (--seq) and dot-bracket structure (--dbn) and write matrices/plots/metrics into an output folder:

  • RNAfm_single_sequence.py: RNA-FM CJ + EternaFold BPP (requires RNA-FM checkpoint + Arnie/EternaFold).
  • Evo2_single_sequence.py: Evo2 CJ + EternaFold BPP.
  • gLM2_single_sequence.py: gLM2 CJ + EternaFold BPP.
  • Vienna_single_sequence.py: ViennaRNA BPP.
  • REDIAL_single_sequence.py: REDIAL contact maps.

These scripts are designed to be run either in your local environment (if dependencies are installed) or inside a container. See the docstrings at the top of each script for example commands.

If you have a sequence you are interested in, you can also try the RNA-FM-based CJ and mirror-test demo in Colab: CJ + Mirror Test (Colab)

About

Code for analyzing thermodynamic and evolutionary signals in RNA genomic language models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors