PUS Writer Dependent sites Analysis

This repository contains the analysis code used to process BID-seq transcriptome data for the PUS writer-dependent pseudouridylation study. The main mapping and candidate-site calling workflow is implemented in BID-transcript.script/polyA_mapping.smk.

The workflow is designed for paired Mock/Treat BID libraries. It preprocesses FASTQ files, maps reads to transcript references, corrects local gap signals, filters candidate T sites, and applies replicate-aware Fisher testing to identify positive sites.

Repository layout

BID-transcript.script/: Snakemake workflows and helper programs for BID transcript mapping and site calling.
BID-transcript.script/polyA_mapping.smk: main poly(A) RNA BID-seq workflow.
BID-transcript.script/Pseudo_callsites.R: Fisher test and positive-site calling script.
BID-transcript.script/config.example.yaml: example configuration file for running the workflow.
paper/: manuscript-related notes.

Upstream code

The helper programs adjustGap, realignGap, samFilter, and cpup are from the y9c/pseudoU-BIDseq project and are used here as part of the BID-seq mapping and gap-processing workflow.

Pipeline overview

For each replicate-level FASTQ pair, polyA_mapping.smk runs:

adapter and length filtering with trim_galore;
duplicate removal with fastp --dedup;
fixed UMI/adapter-region clipping with fastp;
primary mapping to the first configured reference with bowtie2;
optional secondary mapping of unmapped reads with STAR when two reference types are configured;
forward-strand BAM filtering, gap realignment, sorting, calmd, and mismatch filtering;
samtools mpileup and conversion to per-site count tables;
combined gap adjustment across all replicates for each reference type;
T-site filtering with local sequence context extraction;
Mock-vs-Treat Fisher testing with Pseudo_callsites.R.

The reference types are read from pipeline.ref in YAML order. The first reference is used for the Bowtie2 stage. If a second reference is provided, the workflow treats it as a STAR stage and requires paths.index_star.

Requirements

The commands called by the workflow must be available in the runtime environment:

Snakemake
trim_galore
fastp
bowtie2
STAR
samtools
bedtools
csvtk
Python
R

The helper binaries/scripts in BID-transcript.script are also required: samFilter, realignGap, cpup, adjustGap, combine_mpileup_tsv.py, and Pseudo_callsites.R.

Configure a run

Start from the example file:

cp BID-transcript.script/config.example.yaml config.yaml

Edit config.yaml for your dataset:

paths.script_dir: absolute path to BID-transcript.script.
paths.pseudo_callsites_r: absolute path to BID-transcript.script/Pseudo_callsites.R.
paths.index_bowtie2: Bowtie2 index basename for the first reference type.
paths.index_star: STAR genome directory; required only when two reference types are configured.
env.python: Python executable used for combine_mpileup_tsv.py.
env.r: R executable used to run Pseudo_callsites.R.
pipeline.ref: one or two named reference FASTA paths. Reference FASTA files must have .fai indexes for the T-site context step.
pipeline.t_site_windows: number of bases on each side of the candidate T site to extract as context.
samples: biological sample definitions. Each sample must contain matched mock and treat replicates. Replicate names must match between groups, and every file_index must be unique.

Relative FASTQ paths in samples.*.*.*.read1/read2 are resolved relative to the Snakemake working directory passed by --directory.

Run the workflow

The analysis run command follows the reference script at /home/HL/project/shuangli/PUS/PUScaller/Mapping_pipeline/BID-mod/polyARNA/run.sh. A portable version of that launcher is:

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

snakemake \
  -s "/path/to/BID-transcript.script/polyA_mapping.smk" \
  --configfile "$SCRIPT_DIR/config.yaml" \
  --directory "$SCRIPT_DIR" \
  --cores 200 \
  --rerun-incomplete \
  "$@"

In practice, create a run directory that contains config.yaml and raw_data/, then run:

snakemake \
  -s /path/to/BID-transcript.script/polyA_mapping.smk \
  --configfile config.yaml \
  --directory . \
  --cores 200 \
  --rerun-incomplete

Additional Snakemake options can be appended as needed, for example -n for a dry run or --printshellcmds for command logging.

Main outputs

The workflow writes outputs under the run directory:

fix.fastq/: trimmed and deduplicated FASTQ files.
umi_extract.fastq/: reads after fixed UMI/adapter-region clipping.
BAM/: mapped, realigned, filtered BAM files and mapping reports.
STAT/mpileup/<ref_type>/: replicate-level mpileup TSV files and COMBINED-<ref_type>.tsv.
STAT/adjustGap/<ref_type>/: gap-adjusted tables and filtered T-site inputs.
STAT/fisher/<ref_type>/: <sample>_sites.Fisher.tsv and <sample>_sites.positive.tsv.
logs/: per-replicate read-count and mapping summary tables.

The positive-site tables in STAT/fisher/<ref_type>/ are the main candidate pseudouridylation site calls for downstream analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
BID-transcript.script		BID-transcript.script
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PUS Writer Dependent sites Analysis

Repository layout

Upstream code

Pipeline overview

Requirements

Configure a run

Run the workflow

Main outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PUS Writer Dependent sites Analysis

Repository layout

Upstream code

Pipeline overview

Requirements

Configure a run

Run the workflow

Main outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages