This repository contains the analysis code used to process BID-seq transcriptome data for the PUS writer-dependent pseudouridylation study. The main mapping and candidate-site calling workflow is implemented in BID-transcript.script/polyA_mapping.smk.
The workflow is designed for paired Mock/Treat BID libraries. It preprocesses FASTQ files, maps reads to transcript references, corrects local gap signals, filters candidate T sites, and applies replicate-aware Fisher testing to identify positive sites.
BID-transcript.script/: Snakemake workflows and helper programs for BID transcript mapping and site calling.BID-transcript.script/polyA_mapping.smk: main poly(A) RNA BID-seq workflow.BID-transcript.script/Pseudo_callsites.R: Fisher test and positive-site calling script.BID-transcript.script/config.example.yaml: example configuration file for running the workflow.paper/: manuscript-related notes.
The helper programs adjustGap, realignGap, samFilter, and cpup are from the y9c/pseudoU-BIDseq project and are used here as part of the BID-seq mapping and gap-processing workflow.
For each replicate-level FASTQ pair, polyA_mapping.smk runs:
- adapter and length filtering with
trim_galore; - duplicate removal with
fastp --dedup; - fixed UMI/adapter-region clipping with
fastp; - primary mapping to the first configured reference with
bowtie2; - optional secondary mapping of unmapped reads with
STARwhen two reference types are configured; - forward-strand BAM filtering, gap realignment, sorting,
calmd, and mismatch filtering; samtools mpileupand conversion to per-site count tables;- combined gap adjustment across all replicates for each reference type;
- T-site filtering with local sequence context extraction;
- Mock-vs-Treat Fisher testing with
Pseudo_callsites.R.
The reference types are read from pipeline.ref in YAML order. The first reference is used for the Bowtie2 stage. If a second reference is provided, the workflow treats it as a STAR stage and requires paths.index_star.
The commands called by the workflow must be available in the runtime environment:
- Snakemake
trim_galorefastpbowtie2STARsamtoolsbedtoolscsvtk- Python
- R
The helper binaries/scripts in BID-transcript.script are also required: samFilter, realignGap, cpup, adjustGap, combine_mpileup_tsv.py, and Pseudo_callsites.R.
Start from the example file:
cp BID-transcript.script/config.example.yaml config.yamlEdit config.yaml for your dataset:
paths.script_dir: absolute path toBID-transcript.script.paths.pseudo_callsites_r: absolute path toBID-transcript.script/Pseudo_callsites.R.paths.index_bowtie2: Bowtie2 index basename for the first reference type.paths.index_star: STAR genome directory; required only when two reference types are configured.env.python: Python executable used forcombine_mpileup_tsv.py.env.r: R executable used to runPseudo_callsites.R.pipeline.ref: one or two named reference FASTA paths. Reference FASTA files must have.faiindexes for the T-site context step.pipeline.t_site_windows: number of bases on each side of the candidate T site to extract as context.samples: biological sample definitions. Each sample must contain matchedmockandtreatreplicates. Replicate names must match between groups, and everyfile_indexmust be unique.
Relative FASTQ paths in samples.*.*.*.read1/read2 are resolved relative to the Snakemake working directory passed by --directory.
The analysis run command follows the reference script at /home/HL/project/shuangli/PUS/PUScaller/Mapping_pipeline/BID-mod/polyARNA/run.sh. A portable version of that launcher is:
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
snakemake \
-s "/path/to/BID-transcript.script/polyA_mapping.smk" \
--configfile "$SCRIPT_DIR/config.yaml" \
--directory "$SCRIPT_DIR" \
--cores 200 \
--rerun-incomplete \
"$@"In practice, create a run directory that contains config.yaml and raw_data/, then run:
snakemake \
-s /path/to/BID-transcript.script/polyA_mapping.smk \
--configfile config.yaml \
--directory . \
--cores 200 \
--rerun-incompleteAdditional Snakemake options can be appended as needed, for example -n for a dry run or --printshellcmds for command logging.
The workflow writes outputs under the run directory:
fix.fastq/: trimmed and deduplicated FASTQ files.umi_extract.fastq/: reads after fixed UMI/adapter-region clipping.BAM/: mapped, realigned, filtered BAM files and mapping reports.STAT/mpileup/<ref_type>/: replicate-level mpileup TSV files andCOMBINED-<ref_type>.tsv.STAT/adjustGap/<ref_type>/: gap-adjusted tables and filtered T-site inputs.STAT/fisher/<ref_type>/:<sample>_sites.Fisher.tsvand<sample>_sites.positive.tsv.logs/: per-replicate read-count and mapping summary tables.
The positive-site tables in STAT/fisher/<ref_type>/ are the main candidate pseudouridylation site calls for downstream analysis.