Comparative Analysis of RNA Secondary Structure across Eukaryotic mRNAs

Master's Thesis — Sören Yannick Seitz, University of Vienna Supervisor: Univ.-Prof. Dipl.-Phys. Dr. Ivo Hofacker Co-Supervisors: Mag. Dr. Michael Thomas Wolfinger, Mag. Stefan Badelt, PhD Department of Theoretical Chemistry, Faculty of Chemistry

This repository contains the analysis code and custom Python library for a comparative study of RNA secondary structure features across 12 eukaryotic species.

Species

Category	Species
Mammals	Human, Mouse, Cow, Bat, Platypus
Birds	Guineafowl
Fish	Zebrafish
Insects	Fruit fly (Drosophila melanogaster)
Plants	Arabidopsis, Maize, Rice
Fungi	Yeast (Saccharomyces cerevisiae)

Related Repository

predTED — A gradient-boosted regression model for predicting RNA tree-edit distances from structural features. Used as the prefiltering step in the clustering pipeline (Thesis Sections 3.5 and 4.3).
Polars — The fast DataFrame library used throughout this project. All per-species data is stored as masterDataFrame.parquet and processed with Polars (polars-lts-cpu).

Repository Structure

├── my_module/                       # Core Python library (see below)
├── RNAclustPY/                      # Python port of RNAclust (clustering pipeline)
│
│  Data generation (write to masterDataFrame.parquet)
├── workflow.ipynb                    # Main orchestration pipeline
├── masterDataFrame.ipynb            # Central dataframe generation
├── RNALfold.ipynb                   # Local structure prediction (SLURM)
├── RNAplfold.ipynb                  # Opening energy calculation (SLURM)
├── shuffling_transcripts.ipynb      # Sequence shuffling controls
│
│  Analysis (read from masterDataFrame.parquet)
├── data_statistics3.ipynb           # Transcript & GC statistics
├── kozak_sequence.ipynb             # GC profiles around start/stop codons
├── codon_bias.ipynb                 # Codon usage analysis
├── Phylo.ipynb                      # Phylogenetic tree
├── openenergies.ipynb               # Opening energy profiles
│
│  Structural motif discovery
├── MSA with RNAclust.py.ipynb       # Clustering pipeline & Rfam validation
├── RNASearch_quick.ipynb            # RNAsearch motif search
│
├── setup.py                         # Package installation
└── LICENSE                          # MIT License

Installation

# Clone the repository
git clone https://github.com/syseitz/master-s-thesis.git
cd master-s-thesis

# Install in development mode (requires conda environment with ViennaRNA, Infernal)
pip install -e .

Dependencies: polars-lts-cpu, tqdm, numpy Conda packages (not on PyPI): ViennaRNA, Infernal, MAFFT, BioPython

`my_module/` — Core Library

All analysis logic is implemented as reusable functions in my_module/. Notebooks only import the library and call its functions.

API Documentation — Full reference for all modules, auto-generated from docstrings.

import my_module as RNA

Module Overview

Module	Purpose	Thesis reference
`dataframes.py`	Loading and processing the master dataframe (Polars) per species	Chapter 3
`io_operations.py`	I/O for BED, FASTA, MFE, GFF3 files	Chapter 3
`isoforms.py`	Isoform selection (longest, MANE Select)	Section 3.1
`statistics.py`	Statistical analysis and matplotlib visualisations	Chapters 4–5
`shuffling.py`	Sequence shuffling orchestration (mono-, di-, trinucleotide, codon)	Section 3.4
`CodonShuffle3.py`	Codon shuffling with GC/AT preservation at 3rd codon position	Section 3.4
`openenergies.py`	RNA opening energy calculations with confidence intervals	Section 4.2
`rna_structures.py`	RNA secondary structure prediction via ViennaRNA	Section 3.4
`codon_bias.py`	Codon usage analysis (ENC, gtAI, CAI, FOP)	Section 4.1
`kozak_sequence.py`	Kozak motif extraction and GC profiles around start/stop codons	Section 4.1
`cmscan_output_parser.py`	Parsing Rfam cmscan `.tblout` results	Section 4.3
`rfam.py`	Rfam data handling and annotation	Section 4.3
`make_cm_library.py`	Building covariance model libraries from consensus structures	Section 3.5
`run_cmscan.py`	Automated cmscan execution against CM libraries	Section 3.5
`pairwise_alignments.py`	Pairwise RNA structure alignments	Section 3.5
`AlifoldConsensusFinder.py`	Consensus structure derivation from multiple sequence alignments	Section 3.5
`rnasearch.py`	RNAsearch interface for motif discovery	Section 4.3
`RNAsoup.py`	Cluster viewer for RNAclust results	Section 3.5
`rna_clust_tree.py`	RNAclust dendrogram visualisation	Section 3.5
`bed_files.py`	BED6 format processing for RNALfold output	Section 3.4
`nucleotide_count.py`	Nucleotide counting utilities	Chapter 3
`codonmapping.py`	Codon-to-position mapping generation	Section 3.3
`cluster_processing.py`	SLURM pipeline for cross-species batch processing	Chapter 3
`slurm_cluster.py`	SLURM job submission and monitoring	Chapter 3
`mane_select.py`	MANE Select transcript handling (Human)	Section 3.1
`rpfdb.py`	Ribosome profiling database access	—
`formatting.py`	Terminal output formatting	—
`logging_config.py`	Logging configuration	—

Notebooks

Data Generation Notebooks

Notebook	Purpose	Thesis reference
`workflow.ipynb`	Main orchestration pipeline — coordinates data processing across all 12 species	Chapter 3
`masterDataFrame.ipynb`	Builds the central `masterDataFrame.parquet` per species (sequences, regions, GC, MFE)	Chapter 3
`RNALfold.ipynb`	Generates SLURM jobs for local structure prediction (z-score ≤ −2); results are stored in the master dataframe	Section 3.4
`RNAplfold.ipynb`	Generates SLURM jobs for opening energy calculation (window = 210 nt, span = 100 nt); results are stored in the master dataframe	Section 3.4
`shuffling_transcripts.ipynb`	Generates shuffled control sequences (mono-, di-, trinucleotide, codon); shuffled opening energies are stored in the master dataframe	Section 3.4

Analysis Notebooks (read from `masterDataFrame.parquet`)

Notebook	Purpose	Thesis reference
`data_statistics3.ipynb`	Transcript length distributions, regional GC content across species	Section 4.1
`kozak_sequence.ipynb`	GC content profiles around start and stop codons	Section 4.1
`codon_bias.ipynb`	Codon usage analysis (RSCU, ENC, CAI, FOP, gtAI)	Section 4.1
`Phylo.ipynb`	OrthoFinder proteome preparation and phylogenetic tree	Section 4.1
`openenergies.ipynb`	Opening energy profiles around start/stop codons with shuffling controls	Section 4.2

Structural Motif Discovery Notebooks

`MSA with RNAclust.py.ipynb` — Structural Motif Discovery and Rfam Validation

This is the main analysis notebook for the structural motif discovery pipeline (Thesis Chapter 4, Section 4.3).

Pipeline steps:

Prefiltering — Extract local structures from RNALfold predictions, filter by z-score
Pairwise distance estimation — Use predTED to estimate tree-edit distances between structures
Hierarchical clustering — Group similar structures using fastcluster
Structure-aware MSA — Align clusters with LocARNA/mLocarna
Consensus structures — Derive consensus secondary structures (RNAalifold)
CM library construction — Build covariance models from consensus structures
Cross-species cmscan — Scan species transcriptomes against CM libraries
Unbiased Rfam validation — Overlap whole-transcript cmscan hits with local structure clusters (cells 56–70)

Thesis figures generated:

Rfam validation rates per species and region (Section 4.3.3)
Conserved Rfam family heatmaps across species (Section 4.3.3)
Detailed overlap tables for Human (Section 4.3.3)

`RNASearch_quick.ipynb` — RNAsearch-Based Motif Search

Motif discovery using the RNAsearch tool on Human 5'UTR structures (Thesis Section 4.3.4).

Pipeline steps:

Load master dataframe and select longest isoforms
Read RNALfold predictions for 5'UTR regions
Compute global structures and export in .dbn format
Run RNAsearch for substructure motif detection
Compare with mRNA-level cmscan results (unbiased approach)

Cross-Species Aggregation Notebooks

Notebook	Purpose	Thesis reference
`cross_species_analysis.ipynb`	Aggregates GraphClust cross-species CM-search results across both UTR regions and 12 species. Per-cluster paralogue-fraction columns and convergence summary.	Sections 4.3, 3.6.6, 3.6.7
`cross_species_5UTR.ipynb`	5'UTR-specific cross-kingdom aggregation. Produces the cluster overview behind Figure 4.5.	Section 4.3
`alpha_sweep.ipynb`	Duda–Hart α-parameter sweep for RNAsoup partitioning. Generates the calibration plots that motivate the chosen α.	Section 3.5.4

`RNAclustPY/` — Python Port of RNAclust

A Python reimplementation of the RNAclust clustering pipeline, originally written in Perl. Used for hierarchical clustering of RNA secondary structures based on pairwise distances.

Script	Purpose
`RNAclust.py`	Main clustering pipeline
`fastcluster_tree.py`	Hierarchical clustering with fastcluster
`locarna_rnafold_pp.py`	LocARNA interface for structure-aware alignment
`rnaclustAlignRange.py`	Alignment of cluster subranges
`rnaclustCleanAln.py`	Alignment cleaning
`rnaclustScores2Dist.py`	Score-to-distance conversion
`rnasoup.py`	Cluster visualisation (RNAsoup format)
`rnasoup_consMFE.py`	Consensus MFE calculation

Data Flow

The central data structure is masterDataFrame.parquet (one per species, 0.7–6.9 GB). It aggregates all computed features — sequences, region boundaries, GC content, MFE values, opening energies, RNALfold local structures, codon bias metrics — into a single Polars dataframe. All downstream analysis notebooks read from this file.

GFF3 / FASTA files
       │
       ▼
  masterDataFrame.parquet  (per species)
  ┌────┴─────────────────────────────────────────────────┐
  │  Sequences, region boundaries, GC content,           │
  │  MFE, opening energies (RNAplfold), RNALfold         │
  │  local structures, codon bias metrics (ENC, CAI, …)  │
  └──┬───────────┬───────────┬───────────────────────────┘
     │           │           │
     ▼           │           ▼
  Analysis       │      Structural motif discovery
  notebooks      │           │
  (Sec. 4.1–4.2)│           ├──► predTED prefiltering (separate repo)
     │           │           │         │
     ▼           │           │         ▼
  Figures:       │           │    RNAclust clustering
  GC profiles,   │           │         │
  opening        │           │         ▼
  energies,      │           │    CM library ──► cmscan
  codon bias     │           │
                 │           └──► RNAsearch
                 │
                 ▼
            Shuffling controls
            (mono-/di-/trinucleotide, codon)

Thesis Chapter 3 — Materials & Methods → Code Mapping

The table below maps every subsection of Chapter 3 to the primary code that produced its results. For tool-binaries (ViennaRNA, Infernal, OrthoFinder, GraphClust2) the wrapper script in this repo is listed; the binary itself is an external dependency. A more granular mapping (every M&M item, all helper functions, output-artefact paths, known gaps) is provided in MATERIALS_AND_METHODS_MAPPING.md.

3.1 Data sources

§	Method	Primary code	Secondary / library
3.1.1	Reference genomes (download)	external (GENCODE, Ensembl, RAP-DB, SGD) — versions in thesis Table 3.1	—
3.1.2	Transcript-feature extraction → `masterDataFrame.parquet`	`generate_masterDataFrame.py`, `masterDataFrame.ipynb`	`my_module/bed_files.py`, `gff3_reader.py`, `my_module/dataframes.py`, `my_module/io_operations.py`

3.2 Phylogenetic context

§	Method	Primary code	Secondary
3.2.1	Proteome preparation (longest isoform per gene)	`Phylo.ipynb`	`my_module/dataframes.py`
3.2.2	Orthogroup inference & rooted species tree	external `OrthoFinder` (SLURM) + `Phylo.ipynb`	post-processing in `Phylo.ipynb`

3.3 Sequence-level analyses

§	Method	Primary code	Secondary
3.3.1	Regional and windowed GC profiles	`data_statistics3.ipynb`	`my_module/statistics.py`, `my_module/dataframes.py`
3.3.2	Codon usage bias (RSCU, ENC)	`codon_bias.ipynb`	`my_module/codon_bias.py`, `my_module/codonmapping.py`, `cai` library
3.3.3	Kozak PWM and information content	`kozak_sequence.ipynb`	`my_module/kozak_sequence.py`

3.4 RNA secondary-structure prediction

§	Method	Primary code	Secondary
3.4.1	RNAfold global MFE + GC quadratic fit	`RNALfold.ipynb` (MFE/GC cells)	`my_module/rna_structures.py`, `my_module/statistics.py`
3.4.2	Length-normalised CDS GC profile	`data_statistics3.ipynb`	same as 3.3.1
3.4.3	Local accessibility (RNAplfold opening energy)	`process_openenergies.py`, `analyze_multi_species_open_energy.py`, `RNAplfold.ipynb`	`my_module/openenergies.py`
3.4.4	Sequence shuffling controls (uShuffle k=1/k=2, CodonShuffle dn231, both-ends mode)	`update_all_species_shuffling.py`, `run_shuffle.py`, `shuffling_transcripts.ipynb`	`my_module/shuffling.py`, `my_module/CodonShuffle3.py`

3.5 Comparative structure analysis

§	Method	Primary code	Secondary
3.5.1	RNALfold local-structure extraction (z ≤ −2)	`RNALfold_zscore_regions.py`, `RNALfold_regions.py`, `RNALfold.ipynb`	`lfolds_to_bed6.py`
3.5.2	Essentiality filter (region-shuffled robustness)	`RNALfold_zscore_regions.py` (multi-shuffled loop) + `run_filter.py`	`my_module/shuffling.py::shuffle_region`
3.5.3	predTED prefilter	external github.com/syseitz/predTED; local drivers `prefiltering.py`, `compute_distances.py`, `sparse_to_subclusters.py`	LightGBM, C++/OpenMP
3.5.4	Hierarchical clustering, RNAclust + RNAsoup, RNAz, R-Scape	`MSA with RNAclust.py.ipynb`, `balanced_subclustering.py`, `compute_dh_partitions.py`, `make_rnaclust_array_jobs.py`	`RNAclustPY/*` (full pipeline), `validate_annotation_transfer.py`, `run_rnaz_mouse_3utr.py`
3.5.5.1	Whole-transcript Rfam validation (unbiased)	`run_cmscan.py`, `run_rnasearch_rfam_overlap.py`, `MSA with RNAclust.py.ipynb` (cells in §"Unbiased Rfam validation")	`my_module/cmscan_output_parser.py`, `my_module/rfam_overlap.py`
3.5.5.2	Rfam label propagation (explored & discarded)	`scan_annotation_transfers.py`, `validate_annotation_transfer.py`	—

3.6 Pooled multi-species GraphClust

§	Method	Primary code	Secondary
3.6.1–3.6.2	Pipeline overview, configuration (80×3 rounds)	`graphclust_local.py`	external github.com/syseitz/GraphClust (NSPDK streaming patch)
3.6.3	Pooled FASTA construction (12-species union)	`extract_lfold_fastas.py`	—
3.6.4	Workstation runs (24-thread, 64 GB)	`graphclust_local.py`	—
3.6.5	Cluster-level alignment quality, R-Scape, window-level Rfam	`analyze_cluster_quality.py`, `run_optimized_aggregation.py`, `run_optimized_aggregation_5utr.py`, `family_cm_roundtrip.py`, `validate_annotation_transfer.py`	—
3.6.6	g:Profiler functional enrichment	`run_optimized_aggregation.py` (gene-list export) + web interface at biit.cs.ut.ee/gprofiler — see Known gaps	`cross_species_analysis.ipynb`
3.6.7	Pcdh / pcdh2a paralogue contamination filter	`my_module/test_optimized_aggregation.py::test_optimized_mcl` (ad-hoc analysis inside the aggregation loop)	`cross_species_analysis.ipynb`

3.7 Pipeline summary

Element	Code	Notes
Side-by-side overview figure	`Master_Thesis/figures/tikz_methods_overview.tex`	TikZ schematic

Reproduction paths for the headline results

Thesis figure / table	Steps
Table 4.1 (transcript counts, GC by region)	`generate_masterDataFrame.py` → `data_statistics3.ipynb` § "Region-level GC"
Figure 4.5 (5'UTR pooled GraphClust cluster overview)	`extract_lfold_fastas.py` → `graphclust_local.py` → `run_optimized_aggregation_5utr.py` → `analyze_cluster_quality.py` → `validate_annotation_transfer.py` → `cross_species_5UTR.ipynb`
Figure 4.7 (Mouse 3'UTR Pcdhg case study)	`RNALfold_zscore_regions.py` → predTED sparse-mode → `sparse_to_subclusters.py` → `MSA with RNAclust.py.ipynb` (Mouse 3'UTR cells) → `validate_annotation_transfer.py` → figures via `regenerate_figures.py`

Known gaps

g:Profiler (§3.6.6) is not wrapped in Python in this repo. The gene lists exported by run_optimized_aggregation.py were submitted manually through the g:Profiler web interface. To reproduce: either repeat the manual submission, or add a gprofiler-official PyPI wrapper.
openenergies.ipynb is a thin wrapper (3 cells). The actual computation lives in process_openenergies.py and average_openenergies_shuffled.py. The notebook is kept for sanity-checking, not as a reproduction driver.
codon_mapping.ipynb (per-codon detail) is not committed. The committed alternative is codon_bias.ipynb for the headline RSCU/ENC results.
External tool binaries (OrthoFinder, GraphClust2, Infernal, ViennaRNA) are not part of this repo; versions are pinned in thesis §3.

Reproducibility

All analyses were run on the SLURM cluster of the Institute for Theoretical Chemistry (TBI), University of Vienna, with conda environment agat. The my_module.slurm_cluster module automates job submission. Due to the large data volume (~100 GB per species), raw data and intermediate results are not included in this repository.

To reproduce the analysis:

Obtain reference genomes and annotations (see Thesis Table 3.1)
Generate master dataframes using the pipeline in my_module/dataframes.py
Run structural predictions (RNALfold, RNAplfold) via SLURM
Execute the notebooks in the order described above

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparative Analysis of RNA Secondary Structure across Eukaryotic mRNAs

Species

Related Repository

Repository Structure

Installation

`my_module/` — Core Library

Module Overview

Notebooks

Data Generation Notebooks

Analysis Notebooks (read from `masterDataFrame.parquet`)

Structural Motif Discovery Notebooks

`MSA with RNAclust.py.ipynb` — Structural Motif Discovery and Rfam Validation

`RNASearch_quick.ipynb` — RNAsearch-Based Motif Search

Cross-Species Aggregation Notebooks

`RNAclustPY/` — Python Port of RNAclust

Data Flow

Thesis Chapter 3 — Materials & Methods → Code Mapping

3.1 Data sources

3.2 Phylogenetic context

3.3 Sequence-level analyses

3.4 RNA secondary-structure prediction

3.5 Comparative structure analysis

3.6 Pooled multi-species GraphClust

3.7 Pipeline summary

Reproduction paths for the headline results

Known gaps

Reproducibility

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
RNAclustPY		RNAclustPY
docs		docs
my_module		my_module
plots/alpha_sweep		plots/alpha_sweep
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MATERIALS_AND_METHODS_MAPPING.md		MATERIALS_AND_METHODS_MAPPING.md
MSA with RNAclust.py.ipynb		MSA with RNAclust.py.ipynb
Phylo.ipynb		Phylo.ipynb
README.md		README.md
RNALfold.ipynb		RNALfold.ipynb
RNASearch_quick.ipynb		RNASearch_quick.ipynb
RNAplfold.ipynb		RNAplfold.ipynb
alpha_sweep.ipynb		alpha_sweep.ipynb
build_family_consensus.py		build_family_consensus.py
codon_bias.ipynb		codon_bias.ipynb
cross_species_5UTR.ipynb		cross_species_5UTR.ipynb
cross_species_analysis.ipynb		cross_species_analysis.ipynb
data_statistics3.ipynb		data_statistics3.ipynb
kozak_sequence.ipynb		kozak_sequence.ipynb
masterDataFrame.ipynb		masterDataFrame.ipynb
mkdocs.yml		mkdocs.yml
openenergies.ipynb		openenergies.ipynb
pyproject.toml		pyproject.toml
shuffling_transcripts.ipynb		shuffling_transcripts.ipynb
workflow.ipynb		workflow.ipynb

Folders and files

Latest commit

History

Repository files navigation

Comparative Analysis of RNA Secondary Structure across Eukaryotic mRNAs

Species

Related Repository

Repository Structure

Installation

my_module/ — Core Library

Module Overview

Notebooks

Data Generation Notebooks

Analysis Notebooks (read from masterDataFrame.parquet)

Structural Motif Discovery Notebooks

MSA with RNAclust.py.ipynb — Structural Motif Discovery and Rfam Validation

RNASearch_quick.ipynb — RNAsearch-Based Motif Search

Cross-Species Aggregation Notebooks

RNAclustPY/ — Python Port of RNAclust

Data Flow

Thesis Chapter 3 — Materials & Methods → Code Mapping

3.1 Data sources

3.2 Phylogenetic context

3.3 Sequence-level analyses

3.4 RNA secondary-structure prediction

3.5 Comparative structure analysis

3.6 Pooled multi-species GraphClust

3.7 Pipeline summary

Reproduction paths for the headline results

Known gaps

Reproducibility

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`my_module/` — Core Library

Analysis Notebooks (read from `masterDataFrame.parquet`)

`MSA with RNAclust.py.ipynb` — Structural Motif Discovery and Rfam Validation

`RNASearch_quick.ipynb` — RNAsearch-Based Motif Search

`RNAclustPY/` — Python Port of RNAclust

Packages