Master's Thesis — Sören Yannick Seitz, University of Vienna Supervisor: Univ.-Prof. Dipl.-Phys. Dr. Ivo Hofacker Co-Supervisors: Mag. Dr. Michael Thomas Wolfinger, Mag. Stefan Badelt, PhD Department of Theoretical Chemistry, Faculty of Chemistry
This repository contains the analysis code and custom Python library for a comparative study of RNA secondary structure features across 12 eukaryotic species.
| Category | Species |
|---|---|
| Mammals | Human, Mouse, Cow, Bat, Platypus |
| Birds | Guineafowl |
| Fish | Zebrafish |
| Insects | Fruit fly (Drosophila melanogaster) |
| Plants | Arabidopsis, Maize, Rice |
| Fungi | Yeast (Saccharomyces cerevisiae) |
- predTED — A gradient-boosted regression model for predicting RNA tree-edit distances from structural features. Used as the prefiltering step in the clustering pipeline (Thesis Sections 3.5 and 4.3).
- Polars — The fast DataFrame library used throughout this project. All per-species data is stored as
masterDataFrame.parquetand processed with Polars (polars-lts-cpu).
├── my_module/ # Core Python library (see below)
├── RNAclustPY/ # Python port of RNAclust (clustering pipeline)
│
│ Data generation (write to masterDataFrame.parquet)
├── workflow.ipynb # Main orchestration pipeline
├── masterDataFrame.ipynb # Central dataframe generation
├── RNALfold.ipynb # Local structure prediction (SLURM)
├── RNAplfold.ipynb # Opening energy calculation (SLURM)
├── shuffling_transcripts.ipynb # Sequence shuffling controls
│
│ Analysis (read from masterDataFrame.parquet)
├── data_statistics3.ipynb # Transcript & GC statistics
├── kozak_sequence.ipynb # GC profiles around start/stop codons
├── codon_bias.ipynb # Codon usage analysis
├── Phylo.ipynb # Phylogenetic tree
├── openenergies.ipynb # Opening energy profiles
│
│ Structural motif discovery
├── MSA with RNAclust.py.ipynb # Clustering pipeline & Rfam validation
├── RNASearch_quick.ipynb # RNAsearch motif search
│
├── setup.py # Package installation
└── LICENSE # MIT License
# Clone the repository
git clone https://github.com/syseitz/master-s-thesis.git
cd master-s-thesis
# Install in development mode (requires conda environment with ViennaRNA, Infernal)
pip install -e .Dependencies: polars-lts-cpu, tqdm, numpy Conda packages (not on PyPI): ViennaRNA, Infernal, MAFFT, BioPython
All analysis logic is implemented as reusable functions in my_module/. Notebooks only import the library and call its functions.
API Documentation — Full reference for all modules, auto-generated from docstrings.
import my_module as RNA| Module | Purpose | Thesis reference |
|---|---|---|
dataframes.py |
Loading and processing the master dataframe (Polars) per species | Chapter 3 |
io_operations.py |
I/O for BED, FASTA, MFE, GFF3 files | Chapter 3 |
isoforms.py |
Isoform selection (longest, MANE Select) | Section 3.1 |
statistics.py |
Statistical analysis and matplotlib visualisations | Chapters 4–5 |
shuffling.py |
Sequence shuffling orchestration (mono-, di-, trinucleotide, codon) | Section 3.4 |
CodonShuffle3.py |
Codon shuffling with GC/AT preservation at 3rd codon position | Section 3.4 |
openenergies.py |
RNA opening energy calculations with confidence intervals | Section 4.2 |
rna_structures.py |
RNA secondary structure prediction via ViennaRNA | Section 3.4 |
codon_bias.py |
Codon usage analysis (ENC, gtAI, CAI, FOP) | Section 4.1 |
kozak_sequence.py |
Kozak motif extraction and GC profiles around start/stop codons | Section 4.1 |
cmscan_output_parser.py |
Parsing Rfam cmscan .tblout results |
Section 4.3 |
rfam.py |
Rfam data handling and annotation | Section 4.3 |
make_cm_library.py |
Building covariance model libraries from consensus structures | Section 3.5 |
run_cmscan.py |
Automated cmscan execution against CM libraries | Section 3.5 |
pairwise_alignments.py |
Pairwise RNA structure alignments | Section 3.5 |
AlifoldConsensusFinder.py |
Consensus structure derivation from multiple sequence alignments | Section 3.5 |
rnasearch.py |
RNAsearch interface for motif discovery | Section 4.3 |
RNAsoup.py |
Cluster viewer for RNAclust results | Section 3.5 |
rna_clust_tree.py |
RNAclust dendrogram visualisation | Section 3.5 |
bed_files.py |
BED6 format processing for RNALfold output | Section 3.4 |
nucleotide_count.py |
Nucleotide counting utilities | Chapter 3 |
codonmapping.py |
Codon-to-position mapping generation | Section 3.3 |
cluster_processing.py |
SLURM pipeline for cross-species batch processing | Chapter 3 |
slurm_cluster.py |
SLURM job submission and monitoring | Chapter 3 |
mane_select.py |
MANE Select transcript handling (Human) | Section 3.1 |
rpfdb.py |
Ribosome profiling database access | — |
formatting.py |
Terminal output formatting | — |
logging_config.py |
Logging configuration | — |
| Notebook | Purpose | Thesis reference |
|---|---|---|
workflow.ipynb |
Main orchestration pipeline — coordinates data processing across all 12 species | Chapter 3 |
masterDataFrame.ipynb |
Builds the central masterDataFrame.parquet per species (sequences, regions, GC, MFE) |
Chapter 3 |
RNALfold.ipynb |
Generates SLURM jobs for local structure prediction (z-score ≤ −2); results are stored in the master dataframe | Section 3.4 |
RNAplfold.ipynb |
Generates SLURM jobs for opening energy calculation (window = 210 nt, span = 100 nt); results are stored in the master dataframe | Section 3.4 |
shuffling_transcripts.ipynb |
Generates shuffled control sequences (mono-, di-, trinucleotide, codon); shuffled opening energies are stored in the master dataframe | Section 3.4 |
| Notebook | Purpose | Thesis reference |
|---|---|---|
data_statistics3.ipynb |
Transcript length distributions, regional GC content across species | Section 4.1 |
kozak_sequence.ipynb |
GC content profiles around start and stop codons | Section 4.1 |
codon_bias.ipynb |
Codon usage analysis (RSCU, ENC, CAI, FOP, gtAI) | Section 4.1 |
Phylo.ipynb |
OrthoFinder proteome preparation and phylogenetic tree | Section 4.1 |
openenergies.ipynb |
Opening energy profiles around start/stop codons with shuffling controls | Section 4.2 |
This is the main analysis notebook for the structural motif discovery pipeline (Thesis Chapter 4, Section 4.3).
Pipeline steps:
- Prefiltering — Extract local structures from RNALfold predictions, filter by z-score
- Pairwise distance estimation — Use predTED to estimate tree-edit distances between structures
- Hierarchical clustering — Group similar structures using fastcluster
- Structure-aware MSA — Align clusters with LocARNA/mLocarna
- Consensus structures — Derive consensus secondary structures (RNAalifold)
- CM library construction — Build covariance models from consensus structures
- Cross-species cmscan — Scan species transcriptomes against CM libraries
- Unbiased Rfam validation — Overlap whole-transcript cmscan hits with local structure clusters (cells 56–70)
Thesis figures generated:
- Rfam validation rates per species and region (Section 4.3.3)
- Conserved Rfam family heatmaps across species (Section 4.3.3)
- Detailed overlap tables for Human (Section 4.3.3)
Motif discovery using the RNAsearch tool on Human 5'UTR structures (Thesis Section 4.3.4).
Pipeline steps:
- Load master dataframe and select longest isoforms
- Read RNALfold predictions for 5'UTR regions
- Compute global structures and export in
.dbnformat - Run RNAsearch for substructure motif detection
- Compare with mRNA-level cmscan results (unbiased approach)
| Notebook | Purpose | Thesis reference |
|---|---|---|
cross_species_analysis.ipynb |
Aggregates GraphClust cross-species CM-search results across both UTR regions and 12 species. Per-cluster paralogue-fraction columns and convergence summary. | Sections 4.3, 3.6.6, 3.6.7 |
cross_species_5UTR.ipynb |
5'UTR-specific cross-kingdom aggregation. Produces the cluster overview behind Figure 4.5. | Section 4.3 |
alpha_sweep.ipynb |
Duda–Hart α-parameter sweep for RNAsoup partitioning. Generates the calibration plots that motivate the chosen α. | Section 3.5.4 |
A Python reimplementation of the RNAclust clustering pipeline, originally written in Perl. Used for hierarchical clustering of RNA secondary structures based on pairwise distances.
| Script | Purpose |
|---|---|
RNAclust.py |
Main clustering pipeline |
fastcluster_tree.py |
Hierarchical clustering with fastcluster |
locarna_rnafold_pp.py |
LocARNA interface for structure-aware alignment |
rnaclustAlignRange.py |
Alignment of cluster subranges |
rnaclustCleanAln.py |
Alignment cleaning |
rnaclustScores2Dist.py |
Score-to-distance conversion |
rnasoup.py |
Cluster visualisation (RNAsoup format) |
rnasoup_consMFE.py |
Consensus MFE calculation |
The central data structure is masterDataFrame.parquet (one per species, 0.7–6.9 GB). It aggregates all computed features — sequences, region boundaries, GC content, MFE values, opening energies, RNALfold local structures, codon bias metrics — into a single Polars dataframe. All downstream analysis notebooks read from this file.
GFF3 / FASTA files
│
▼
masterDataFrame.parquet (per species)
┌────┴─────────────────────────────────────────────────┐
│ Sequences, region boundaries, GC content, │
│ MFE, opening energies (RNAplfold), RNALfold │
│ local structures, codon bias metrics (ENC, CAI, …) │
└──┬───────────┬───────────┬───────────────────────────┘
│ │ │
▼ │ ▼
Analysis │ Structural motif discovery
notebooks │ │
(Sec. 4.1–4.2)│ ├──► predTED prefiltering (separate repo)
│ │ │ │
▼ │ │ ▼
Figures: │ │ RNAclust clustering
GC profiles, │ │ │
opening │ │ ▼
energies, │ │ CM library ──► cmscan
codon bias │ │
│ └──► RNAsearch
│
▼
Shuffling controls
(mono-/di-/trinucleotide, codon)
The table below maps every subsection of Chapter 3 to the primary code that produced its results. For tool-binaries (ViennaRNA, Infernal, OrthoFinder, GraphClust2) the wrapper script in this repo is listed; the binary itself is an external dependency. A more granular mapping (every M&M item, all helper functions, output-artefact paths, known gaps) is provided in MATERIALS_AND_METHODS_MAPPING.md.
| § | Method | Primary code | Secondary / library |
|---|---|---|---|
| 3.1.1 | Reference genomes (download) | external (GENCODE, Ensembl, RAP-DB, SGD) — versions in thesis Table 3.1 | — |
| 3.1.2 | Transcript-feature extraction → masterDataFrame.parquet |
generate_masterDataFrame.py, masterDataFrame.ipynb |
my_module/bed_files.py, gff3_reader.py, my_module/dataframes.py, my_module/io_operations.py |
| § | Method | Primary code | Secondary |
|---|---|---|---|
| 3.2.1 | Proteome preparation (longest isoform per gene) | Phylo.ipynb |
my_module/dataframes.py |
| 3.2.2 | Orthogroup inference & rooted species tree | external OrthoFinder (SLURM) + Phylo.ipynb |
post-processing in Phylo.ipynb |
| § | Method | Primary code | Secondary |
|---|---|---|---|
| 3.3.1 | Regional and windowed GC profiles | data_statistics3.ipynb |
my_module/statistics.py, my_module/dataframes.py |
| 3.3.2 | Codon usage bias (RSCU, ENC) | codon_bias.ipynb |
my_module/codon_bias.py, my_module/codonmapping.py, cai library |
| 3.3.3 | Kozak PWM and information content | kozak_sequence.ipynb |
my_module/kozak_sequence.py |
| § | Method | Primary code | Secondary |
|---|---|---|---|
| 3.4.1 | RNAfold global MFE + GC quadratic fit | RNALfold.ipynb (MFE/GC cells) |
my_module/rna_structures.py, my_module/statistics.py |
| 3.4.2 | Length-normalised CDS GC profile | data_statistics3.ipynb |
same as 3.3.1 |
| 3.4.3 | Local accessibility (RNAplfold opening energy) | process_openenergies.py, analyze_multi_species_open_energy.py, RNAplfold.ipynb |
my_module/openenergies.py |
| 3.4.4 | Sequence shuffling controls (uShuffle k=1/k=2, CodonShuffle dn231, both-ends mode) | update_all_species_shuffling.py, run_shuffle.py, shuffling_transcripts.ipynb |
my_module/shuffling.py, my_module/CodonShuffle3.py |
| § | Method | Primary code | Secondary |
|---|---|---|---|
| 3.5.1 | RNALfold local-structure extraction (z ≤ −2) | RNALfold_zscore_regions.py, RNALfold_regions.py, RNALfold.ipynb |
lfolds_to_bed6.py |
| 3.5.2 | Essentiality filter (region-shuffled robustness) | RNALfold_zscore_regions.py (multi-shuffled loop) + run_filter.py |
my_module/shuffling.py::shuffle_region |
| 3.5.3 | predTED prefilter | external github.com/syseitz/predTED; local drivers prefiltering.py, compute_distances.py, sparse_to_subclusters.py |
LightGBM, C++/OpenMP |
| 3.5.4 | Hierarchical clustering, RNAclust + RNAsoup, RNAz, R-Scape | MSA with RNAclust.py.ipynb, balanced_subclustering.py, compute_dh_partitions.py, make_rnaclust_array_jobs.py |
RNAclustPY/* (full pipeline), validate_annotation_transfer.py, run_rnaz_mouse_3utr.py |
| 3.5.5.1 | Whole-transcript Rfam validation (unbiased) | run_cmscan.py, run_rnasearch_rfam_overlap.py, MSA with RNAclust.py.ipynb (cells in §"Unbiased Rfam validation") |
my_module/cmscan_output_parser.py, my_module/rfam_overlap.py |
| 3.5.5.2 | Rfam label propagation (explored & discarded) | scan_annotation_transfers.py, validate_annotation_transfer.py |
— |
| § | Method | Primary code | Secondary |
|---|---|---|---|
| 3.6.1–3.6.2 | Pipeline overview, configuration (80×3 rounds) | graphclust_local.py |
external github.com/syseitz/GraphClust (NSPDK streaming patch) |
| 3.6.3 | Pooled FASTA construction (12-species union) | extract_lfold_fastas.py |
— |
| 3.6.4 | Workstation runs (24-thread, 64 GB) | graphclust_local.py |
— |
| 3.6.5 | Cluster-level alignment quality, R-Scape, window-level Rfam | analyze_cluster_quality.py, run_optimized_aggregation.py, run_optimized_aggregation_5utr.py, family_cm_roundtrip.py, validate_annotation_transfer.py |
— |
| 3.6.6 | g:Profiler functional enrichment | run_optimized_aggregation.py (gene-list export) + web interface at biit.cs.ut.ee/gprofiler — see Known gaps |
cross_species_analysis.ipynb |
| 3.6.7 | Pcdh / pcdh2a paralogue contamination filter | my_module/test_optimized_aggregation.py::test_optimized_mcl (ad-hoc analysis inside the aggregation loop) |
cross_species_analysis.ipynb |
| Element | Code | Notes |
|---|---|---|
| Side-by-side overview figure | Master_Thesis/figures/tikz_methods_overview.tex |
TikZ schematic |
| Thesis figure / table | Steps |
|---|---|
| Table 4.1 (transcript counts, GC by region) | generate_masterDataFrame.py → data_statistics3.ipynb § "Region-level GC" |
| Figure 4.5 (5'UTR pooled GraphClust cluster overview) | extract_lfold_fastas.py → graphclust_local.py → run_optimized_aggregation_5utr.py → analyze_cluster_quality.py → validate_annotation_transfer.py → cross_species_5UTR.ipynb |
| Figure 4.7 (Mouse 3'UTR Pcdhg case study) | RNALfold_zscore_regions.py → predTED sparse-mode → sparse_to_subclusters.py → MSA with RNAclust.py.ipynb (Mouse 3'UTR cells) → validate_annotation_transfer.py → figures via regenerate_figures.py |
- g:Profiler (§3.6.6) is not wrapped in Python in this repo. The gene lists exported by
run_optimized_aggregation.pywere submitted manually through the g:Profiler web interface. To reproduce: either repeat the manual submission, or add agprofiler-officialPyPI wrapper. openenergies.ipynbis a thin wrapper (3 cells). The actual computation lives inprocess_openenergies.pyandaverage_openenergies_shuffled.py. The notebook is kept for sanity-checking, not as a reproduction driver.codon_mapping.ipynb(per-codon detail) is not committed. The committed alternative iscodon_bias.ipynbfor the headline RSCU/ENC results.- External tool binaries (
OrthoFinder,GraphClust2,Infernal,ViennaRNA) are not part of this repo; versions are pinned in thesis §3.
All analyses were run on the SLURM cluster of the Institute for Theoretical Chemistry (TBI), University of Vienna, with conda environment agat. The my_module.slurm_cluster module automates job submission. Due to the large data volume (~100 GB per species), raw data and intermediate results are not included in this repository.
To reproduce the analysis:
- Obtain reference genomes and annotations (see Thesis Table 3.1)
- Generate master dataframes using the pipeline in
my_module/dataframes.py - Run structural predictions (RNALfold, RNAplfold) via SLURM
- Execute the notebooks in the order described above
MIT License — see LICENSE for details.