Skip to content

syseitz/master-s-thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

109 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparative Analysis of RNA Secondary Structure across Eukaryotic mRNAs

Master's Thesis — Sören Yannick Seitz, University of Vienna Supervisor: Univ.-Prof. Dipl.-Phys. Dr. Ivo Hofacker Co-Supervisors: Mag. Dr. Michael Thomas Wolfinger, Mag. Stefan Badelt, PhD Department of Theoretical Chemistry, Faculty of Chemistry

This repository contains the analysis code and custom Python library for a comparative study of RNA secondary structure features across 12 eukaryotic species.

Species

Category Species
Mammals Human, Mouse, Cow, Bat, Platypus
Birds Guineafowl
Fish Zebrafish
Insects Fruit fly (Drosophila melanogaster)
Plants Arabidopsis, Maize, Rice
Fungi Yeast (Saccharomyces cerevisiae)

Related Repository

  • predTED — A gradient-boosted regression model for predicting RNA tree-edit distances from structural features. Used as the prefiltering step in the clustering pipeline (Thesis Sections 3.5 and 4.3).
  • Polars — The fast DataFrame library used throughout this project. All per-species data is stored as masterDataFrame.parquet and processed with Polars (polars-lts-cpu).

Repository Structure

├── my_module/                       # Core Python library (see below)
├── RNAclustPY/                      # Python port of RNAclust (clustering pipeline)
│
│  Data generation (write to masterDataFrame.parquet)
├── workflow.ipynb                    # Main orchestration pipeline
├── masterDataFrame.ipynb            # Central dataframe generation
├── RNALfold.ipynb                   # Local structure prediction (SLURM)
├── RNAplfold.ipynb                  # Opening energy calculation (SLURM)
├── shuffling_transcripts.ipynb      # Sequence shuffling controls
│
│  Analysis (read from masterDataFrame.parquet)
├── data_statistics3.ipynb           # Transcript & GC statistics
├── kozak_sequence.ipynb             # GC profiles around start/stop codons
├── codon_bias.ipynb                 # Codon usage analysis
├── Phylo.ipynb                      # Phylogenetic tree
├── openenergies.ipynb               # Opening energy profiles
│
│  Structural motif discovery
├── MSA with RNAclust.py.ipynb       # Clustering pipeline & Rfam validation
├── RNASearch_quick.ipynb            # RNAsearch motif search
│
├── setup.py                         # Package installation
└── LICENSE                          # MIT License

Installation

# Clone the repository
git clone https://github.com/syseitz/master-s-thesis.git
cd master-s-thesis

# Install in development mode (requires conda environment with ViennaRNA, Infernal)
pip install -e .

Dependencies: polars-lts-cpu, tqdm, numpy Conda packages (not on PyPI): ViennaRNA, Infernal, MAFFT, BioPython

my_module/ — Core Library

All analysis logic is implemented as reusable functions in my_module/. Notebooks only import the library and call its functions.

API Documentation — Full reference for all modules, auto-generated from docstrings.

import my_module as RNA

Module Overview

Module Purpose Thesis reference
dataframes.py Loading and processing the master dataframe (Polars) per species Chapter 3
io_operations.py I/O for BED, FASTA, MFE, GFF3 files Chapter 3
isoforms.py Isoform selection (longest, MANE Select) Section 3.1
statistics.py Statistical analysis and matplotlib visualisations Chapters 4–5
shuffling.py Sequence shuffling orchestration (mono-, di-, trinucleotide, codon) Section 3.4
CodonShuffle3.py Codon shuffling with GC/AT preservation at 3rd codon position Section 3.4
openenergies.py RNA opening energy calculations with confidence intervals Section 4.2
rna_structures.py RNA secondary structure prediction via ViennaRNA Section 3.4
codon_bias.py Codon usage analysis (ENC, gtAI, CAI, FOP) Section 4.1
kozak_sequence.py Kozak motif extraction and GC profiles around start/stop codons Section 4.1
cmscan_output_parser.py Parsing Rfam cmscan .tblout results Section 4.3
rfam.py Rfam data handling and annotation Section 4.3
make_cm_library.py Building covariance model libraries from consensus structures Section 3.5
run_cmscan.py Automated cmscan execution against CM libraries Section 3.5
pairwise_alignments.py Pairwise RNA structure alignments Section 3.5
AlifoldConsensusFinder.py Consensus structure derivation from multiple sequence alignments Section 3.5
rnasearch.py RNAsearch interface for motif discovery Section 4.3
RNAsoup.py Cluster viewer for RNAclust results Section 3.5
rna_clust_tree.py RNAclust dendrogram visualisation Section 3.5
bed_files.py BED6 format processing for RNALfold output Section 3.4
nucleotide_count.py Nucleotide counting utilities Chapter 3
codonmapping.py Codon-to-position mapping generation Section 3.3
cluster_processing.py SLURM pipeline for cross-species batch processing Chapter 3
slurm_cluster.py SLURM job submission and monitoring Chapter 3
mane_select.py MANE Select transcript handling (Human) Section 3.1
rpfdb.py Ribosome profiling database access
formatting.py Terminal output formatting
logging_config.py Logging configuration

Notebooks

Data Generation Notebooks

Notebook Purpose Thesis reference
workflow.ipynb Main orchestration pipeline — coordinates data processing across all 12 species Chapter 3
masterDataFrame.ipynb Builds the central masterDataFrame.parquet per species (sequences, regions, GC, MFE) Chapter 3
RNALfold.ipynb Generates SLURM jobs for local structure prediction (z-score ≤ −2); results are stored in the master dataframe Section 3.4
RNAplfold.ipynb Generates SLURM jobs for opening energy calculation (window = 210 nt, span = 100 nt); results are stored in the master dataframe Section 3.4
shuffling_transcripts.ipynb Generates shuffled control sequences (mono-, di-, trinucleotide, codon); shuffled opening energies are stored in the master dataframe Section 3.4

Analysis Notebooks (read from masterDataFrame.parquet)

Notebook Purpose Thesis reference
data_statistics3.ipynb Transcript length distributions, regional GC content across species Section 4.1
kozak_sequence.ipynb GC content profiles around start and stop codons Section 4.1
codon_bias.ipynb Codon usage analysis (RSCU, ENC, CAI, FOP, gtAI) Section 4.1
Phylo.ipynb OrthoFinder proteome preparation and phylogenetic tree Section 4.1
openenergies.ipynb Opening energy profiles around start/stop codons with shuffling controls Section 4.2

Structural Motif Discovery Notebooks

MSA with RNAclust.py.ipynb — Structural Motif Discovery and Rfam Validation

This is the main analysis notebook for the structural motif discovery pipeline (Thesis Chapter 4, Section 4.3).

Pipeline steps:

  1. Prefiltering — Extract local structures from RNALfold predictions, filter by z-score
  2. Pairwise distance estimation — Use predTED to estimate tree-edit distances between structures
  3. Hierarchical clustering — Group similar structures using fastcluster
  4. Structure-aware MSA — Align clusters with LocARNA/mLocarna
  5. Consensus structures — Derive consensus secondary structures (RNAalifold)
  6. CM library construction — Build covariance models from consensus structures
  7. Cross-species cmscan — Scan species transcriptomes against CM libraries
  8. Unbiased Rfam validation — Overlap whole-transcript cmscan hits with local structure clusters (cells 56–70)

Thesis figures generated:

  • Rfam validation rates per species and region (Section 4.3.3)
  • Conserved Rfam family heatmaps across species (Section 4.3.3)
  • Detailed overlap tables for Human (Section 4.3.3)

RNASearch_quick.ipynb — RNAsearch-Based Motif Search

Motif discovery using the RNAsearch tool on Human 5'UTR structures (Thesis Section 4.3.4).

Pipeline steps:

  1. Load master dataframe and select longest isoforms
  2. Read RNALfold predictions for 5'UTR regions
  3. Compute global structures and export in .dbn format
  4. Run RNAsearch for substructure motif detection
  5. Compare with mRNA-level cmscan results (unbiased approach)

Cross-Species Aggregation Notebooks

Notebook Purpose Thesis reference
cross_species_analysis.ipynb Aggregates GraphClust cross-species CM-search results across both UTR regions and 12 species. Per-cluster paralogue-fraction columns and convergence summary. Sections 4.3, 3.6.6, 3.6.7
cross_species_5UTR.ipynb 5'UTR-specific cross-kingdom aggregation. Produces the cluster overview behind Figure 4.5. Section 4.3
alpha_sweep.ipynb Duda–Hart α-parameter sweep for RNAsoup partitioning. Generates the calibration plots that motivate the chosen α. Section 3.5.4

RNAclustPY/ — Python Port of RNAclust

A Python reimplementation of the RNAclust clustering pipeline, originally written in Perl. Used for hierarchical clustering of RNA secondary structures based on pairwise distances.

Script Purpose
RNAclust.py Main clustering pipeline
fastcluster_tree.py Hierarchical clustering with fastcluster
locarna_rnafold_pp.py LocARNA interface for structure-aware alignment
rnaclustAlignRange.py Alignment of cluster subranges
rnaclustCleanAln.py Alignment cleaning
rnaclustScores2Dist.py Score-to-distance conversion
rnasoup.py Cluster visualisation (RNAsoup format)
rnasoup_consMFE.py Consensus MFE calculation

Data Flow

The central data structure is masterDataFrame.parquet (one per species, 0.7–6.9 GB). It aggregates all computed features — sequences, region boundaries, GC content, MFE values, opening energies, RNALfold local structures, codon bias metrics — into a single Polars dataframe. All downstream analysis notebooks read from this file.

GFF3 / FASTA files
       │
       ▼
  masterDataFrame.parquet  (per species)
  ┌────┴─────────────────────────────────────────────────┐
  │  Sequences, region boundaries, GC content,           │
  │  MFE, opening energies (RNAplfold), RNALfold         │
  │  local structures, codon bias metrics (ENC, CAI, …)  │
  └──┬───────────┬───────────┬───────────────────────────┘
     │           │           │
     ▼           │           ▼
  Analysis       │      Structural motif discovery
  notebooks      │           │
  (Sec. 4.1–4.2)│           ├──► predTED prefiltering (separate repo)
     │           │           │         │
     ▼           │           │         ▼
  Figures:       │           │    RNAclust clustering
  GC profiles,   │           │         │
  opening        │           │         ▼
  energies,      │           │    CM library ──► cmscan
  codon bias     │           │
                 │           └──► RNAsearch
                 │
                 ▼
            Shuffling controls
            (mono-/di-/trinucleotide, codon)

Thesis Chapter 3 — Materials & Methods → Code Mapping

The table below maps every subsection of Chapter 3 to the primary code that produced its results. For tool-binaries (ViennaRNA, Infernal, OrthoFinder, GraphClust2) the wrapper script in this repo is listed; the binary itself is an external dependency. A more granular mapping (every M&M item, all helper functions, output-artefact paths, known gaps) is provided in MATERIALS_AND_METHODS_MAPPING.md.

3.1 Data sources

§ Method Primary code Secondary / library
3.1.1 Reference genomes (download) external (GENCODE, Ensembl, RAP-DB, SGD) — versions in thesis Table 3.1
3.1.2 Transcript-feature extraction → masterDataFrame.parquet generate_masterDataFrame.py, masterDataFrame.ipynb my_module/bed_files.py, gff3_reader.py, my_module/dataframes.py, my_module/io_operations.py

3.2 Phylogenetic context

§ Method Primary code Secondary
3.2.1 Proteome preparation (longest isoform per gene) Phylo.ipynb my_module/dataframes.py
3.2.2 Orthogroup inference & rooted species tree external OrthoFinder (SLURM) + Phylo.ipynb post-processing in Phylo.ipynb

3.3 Sequence-level analyses

§ Method Primary code Secondary
3.3.1 Regional and windowed GC profiles data_statistics3.ipynb my_module/statistics.py, my_module/dataframes.py
3.3.2 Codon usage bias (RSCU, ENC) codon_bias.ipynb my_module/codon_bias.py, my_module/codonmapping.py, cai library
3.3.3 Kozak PWM and information content kozak_sequence.ipynb my_module/kozak_sequence.py

3.4 RNA secondary-structure prediction

§ Method Primary code Secondary
3.4.1 RNAfold global MFE + GC quadratic fit RNALfold.ipynb (MFE/GC cells) my_module/rna_structures.py, my_module/statistics.py
3.4.2 Length-normalised CDS GC profile data_statistics3.ipynb same as 3.3.1
3.4.3 Local accessibility (RNAplfold opening energy) process_openenergies.py, analyze_multi_species_open_energy.py, RNAplfold.ipynb my_module/openenergies.py
3.4.4 Sequence shuffling controls (uShuffle k=1/k=2, CodonShuffle dn231, both-ends mode) update_all_species_shuffling.py, run_shuffle.py, shuffling_transcripts.ipynb my_module/shuffling.py, my_module/CodonShuffle3.py

3.5 Comparative structure analysis

§ Method Primary code Secondary
3.5.1 RNALfold local-structure extraction (z ≤ −2) RNALfold_zscore_regions.py, RNALfold_regions.py, RNALfold.ipynb lfolds_to_bed6.py
3.5.2 Essentiality filter (region-shuffled robustness) RNALfold_zscore_regions.py (multi-shuffled loop) + run_filter.py my_module/shuffling.py::shuffle_region
3.5.3 predTED prefilter external github.com/syseitz/predTED; local drivers prefiltering.py, compute_distances.py, sparse_to_subclusters.py LightGBM, C++/OpenMP
3.5.4 Hierarchical clustering, RNAclust + RNAsoup, RNAz, R-Scape MSA with RNAclust.py.ipynb, balanced_subclustering.py, compute_dh_partitions.py, make_rnaclust_array_jobs.py RNAclustPY/* (full pipeline), validate_annotation_transfer.py, run_rnaz_mouse_3utr.py
3.5.5.1 Whole-transcript Rfam validation (unbiased) run_cmscan.py, run_rnasearch_rfam_overlap.py, MSA with RNAclust.py.ipynb (cells in §"Unbiased Rfam validation") my_module/cmscan_output_parser.py, my_module/rfam_overlap.py
3.5.5.2 Rfam label propagation (explored & discarded) scan_annotation_transfers.py, validate_annotation_transfer.py

3.6 Pooled multi-species GraphClust

§ Method Primary code Secondary
3.6.1–3.6.2 Pipeline overview, configuration (80×3 rounds) graphclust_local.py external github.com/syseitz/GraphClust (NSPDK streaming patch)
3.6.3 Pooled FASTA construction (12-species union) extract_lfold_fastas.py
3.6.4 Workstation runs (24-thread, 64 GB) graphclust_local.py
3.6.5 Cluster-level alignment quality, R-Scape, window-level Rfam analyze_cluster_quality.py, run_optimized_aggregation.py, run_optimized_aggregation_5utr.py, family_cm_roundtrip.py, validate_annotation_transfer.py
3.6.6 g:Profiler functional enrichment run_optimized_aggregation.py (gene-list export) + web interface at biit.cs.ut.ee/gprofiler — see Known gaps cross_species_analysis.ipynb
3.6.7 Pcdh / pcdh2a paralogue contamination filter my_module/test_optimized_aggregation.py::test_optimized_mcl (ad-hoc analysis inside the aggregation loop) cross_species_analysis.ipynb

3.7 Pipeline summary

Element Code Notes
Side-by-side overview figure Master_Thesis/figures/tikz_methods_overview.tex TikZ schematic

Reproduction paths for the headline results

Thesis figure / table Steps
Table 4.1 (transcript counts, GC by region) generate_masterDataFrame.pydata_statistics3.ipynb § "Region-level GC"
Figure 4.5 (5'UTR pooled GraphClust cluster overview) extract_lfold_fastas.pygraphclust_local.pyrun_optimized_aggregation_5utr.pyanalyze_cluster_quality.pyvalidate_annotation_transfer.pycross_species_5UTR.ipynb
Figure 4.7 (Mouse 3'UTR Pcdhg case study) RNALfold_zscore_regions.py → predTED sparse-mode → sparse_to_subclusters.pyMSA with RNAclust.py.ipynb (Mouse 3'UTR cells) → validate_annotation_transfer.py → figures via regenerate_figures.py

Known gaps

  • g:Profiler (§3.6.6) is not wrapped in Python in this repo. The gene lists exported by run_optimized_aggregation.py were submitted manually through the g:Profiler web interface. To reproduce: either repeat the manual submission, or add a gprofiler-official PyPI wrapper.
  • openenergies.ipynb is a thin wrapper (3 cells). The actual computation lives in process_openenergies.py and average_openenergies_shuffled.py. The notebook is kept for sanity-checking, not as a reproduction driver.
  • codon_mapping.ipynb (per-codon detail) is not committed. The committed alternative is codon_bias.ipynb for the headline RSCU/ENC results.
  • External tool binaries (OrthoFinder, GraphClust2, Infernal, ViennaRNA) are not part of this repo; versions are pinned in thesis §3.

Reproducibility

All analyses were run on the SLURM cluster of the Institute for Theoretical Chemistry (TBI), University of Vienna, with conda environment agat. The my_module.slurm_cluster module automates job submission. Due to the large data volume (~100 GB per species), raw data and intermediate results are not included in this repository.

To reproduce the analysis:

  1. Obtain reference genomes and annotations (see Thesis Table 3.1)
  2. Generate master dataframes using the pipeline in my_module/dataframes.py
  3. Run structural predictions (RNALfold, RNAplfold) via SLURM
  4. Execute the notebooks in the order described above

License

MIT License — see LICENSE for details.

About

Comparative analysis of RNA secondary structure across 12 eukaryotic mRNAs — Master's thesis, University of Vienna

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors