Pipeline to (i) build strand-specific TNT-seq peak-summits from replicate bigWigs, (ii) compute summit to anchor distances for 5′SS / 3′SS / start / stop codon, and (iii) optionally compare to steady-state m6A peaks (Schwartz et al., 2014). This workflow is designed to reproduce the distance to anchor style plots used in Louloupi & Ntini et al. (Cell Reports, 2018; 10.1016/j.celrep.2018.05.077).
Inputs
- TNT-seq strand-specific bigWigs (IP and Input; 3 replicates;
plusandminus) - hg19 chromosome sizes (
hg19.chrom.sizes) - GENCODE v19 GTF (
gencode.v19.annotation.gtf.gz) for anchors - Optional:
mmc3.xlsx(Schwartz 2014) for steady-state m6A comparison
Outputs
- TNT-seq peak summits:
results/peaks/summits.all.bed6 - Anchor distance tables:
results/dist/*.tsv - Optional m6A distance tables:
results/dist_m6aseq/*.tsv - Figures:
results/figs/*.png
bedtoolsbigtools(forbigWigAverageOverBed)GNU parallel- Python 3 with:
pandas numpy scipy matplotlib(optional: seaborn)
mamba create -n tntseq -c conda-forge -c bioconda \
python=3.11 pandas numpy scipy matplotlib seaborn \
bedtools parallel bigtools
mamba activate tntseqRecommended folder structure:
TNT_Assignment/
├── raw_data/ # bigWig inputs
├── annotation/ # chrom.sizes, windows, GTF, split beds
├── anchors/ # anchor BEDs (generated)
├── scripts/ # helper scripts + pipelines
├── results/ # outputs
├── scripts/ # helper scripts + pipelines
├── 00_make_windows.sh # Preprocessing helper to create windows for the genome
├── 01_prep_anchors.sh # Preprocessing helper to create anchors based on annotation file
└── 02_run_tntseq_analysis.sh # main analysis runner
Place strand-specific bigWigs under raw_data/ and set paths in 02_run_tntseq_analysis.sh.
Example (from GSE83561):
- Inputs:
GSM3143797-799 - IP:
GSM3143800-802Each replicate has: *.plus.bw*.minus.bw
So the header will be
IP1_PLUS="raw_data/GSM3143800_m6AIP1.plus.bw"
IP1_MINUS="raw_data/GSM3143800_m6AIP1.minus.bw"
IN1_PLUS="raw_data/GSM3143797_m6AInput1.plus.bw"
IN1_MINUS="raw_data/GSM3143797_m6AInput1.minus.bw"
IP2_PLUS="raw_data/GSM3143801_m6AIP2.plus.bw"
IP2_MINUS="raw_data/GSM3143801_m6AIP2.minus.bw"
IN2_PLUS="raw_data/GSM3143798_m6AInput2.plus.bw"
IN2_MINUS="raw_data/GSM3143798_m6AInput2.minus.bw"
IP3_PLUS="raw_data/GSM3143802_m6aIP3.plus.bw"
IP3_MINUS="raw_data/GSM3143802_m6aIP3.minus.bw"
IN3_PLUS="raw_data/GSM3143799_m6aInput3.plus.bw"
IN3_MINUS="raw_data/GSM3143799_m6aInput3.minus.bw"
The paper used hg19; this pipeline assumes hg19 to avoid liftover.
mkdir -p annotation
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes \
-O annotation/hg19.chrom.sizeswget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
-O annotation/gencode.v19.annotation.gtf.gzIf you want TNT-seq vs m6A-seq comparison, provide the Excel file and set M6A=/path/to/xlsx in 02_run_tntseq_analysis.sh.
Creates:
annotation/genome_20bp.bedannotation/genome_20bp_named.bedannotation/split_beds/chr*.bed(canonical chroms only random, Un etc.. will be ignoredx)
./00_make_windows.sh annotation/hg19.chrom.sizes annotation 20
# args: chrom.sizes outdir window_sizeCreates (BED6, 1bp anchors, strand-aware):
anchors/anchors_5SS.bedanchors/anchors_3SS.bedanchors/anchors_startcodon.bedanchors/anchors_stopcodon.bed
./01_prep_anchors.sh annotation/gencode.v19.annotation.gtf.gz anchors scripts
# args: gtf.gz out_anchor_dir scripts_dirThis produces TNT-seq summits, distance tables, and plots; optionally includes m6A-seq comparison if M6A is set and the file exists.
./02_run_tntseq_analysis.sh
# configure paths at the top of the script (bigWigs, OUTDIR, ANCHORS, M6A)- Coverage matrices (20 bp bins) For each replicate and strand:
- computes IP and Input coverage per 20 bp bin using
bigWigAverageOverBed - writes
results/matrx/repX_{plus,minus}_matrix.txt(chr, start, end, IP_sum, Input_sum)
- Bin scoring (per replicate)
Runs
score_bins.pyto compute per-bin enrichment/statistics and writes:
results/scored/repX.{plus,minus}.bed
- Consensus bins across replicates
This step enforces two filters:
- Statistical significance in all replicates (FDR / q-value)
- Each replicate is scored
score_bins.pyand only bins with q < 0.05 are kept. - An all-3 intersection is then taken so the bin must be significant (q < 0.05) in rep1 AND rep2 AND rep3 (coordinate match, strand-specific)
- Each replicate is scored
- Enrichment support across replicates (Fold enrichment)
- For bins passing the all-3 q-filter, it's required FE ≥ 4 in at least 2 out of 3 replicates (2/3 rule).
Outputs:
- For bins passing the all-3 q-filter, it's required FE ≥ 4 in at least 2 out of 3 replicates (2/3 rule).
- Statistical significance in all replicates (FDR / q-value)
results/final_bins/sigbins.plus.bed6results/final_bins/sigbins.minus.bed6
- Peak merging and summits
-
merges adjacent significant bins into peaks (strand preserved as BED6)
-
maps bins → peak IDs
-
calls per-peak summit (via
make_summits.py) -
writes:
results/peaks/summits.plus.bed6results/peaks/summits.minus.bed6results/peaks/summits.all.bed6
- Distances to anchors
-
sorts summits + anchors
-
computes nearest anchor distances using
bedtools closest -sorted -s -t first -d -
writes:
results/dist/summit_to_5SS.tsvresults/dist/summit_to_3SS.tsvresults/dist/summit_to_start.tsvresults/dist/summit_to_stop.tsv
- Optional: steady-state m6A-seq comparison
If
M6Aexists:
-
converts Excel peaks to BED6 summits (
xlsx_to_summits_bed.py, sheet"Human Peaks") -
computes m6A summit distances to the same anchors
-
writes:
results/m6aseq/peaks/summits.all.bed6results/dist_m6aseq/*.tsv
- Plotting Generates two-line plots (TNT-seq vs m6A-seq):
- raw counts:
results/figs/fig_TNT_vs_m6A.png - normalized density:
results/figs/fig_TNT_vs_m6A_norm.png(value/MAX per window, so max value is 1)
-
results/peaks/summits.all.bed6Strand-aware TNT-seq summits (BED6) -
results/dist/*.tsvSummit-to-anchor closest results (bedtoolsclosestoutput)
results/m6aseq/peaks/summits.all.bed6results/dist_m6aseq/*.tsv
My plots are close to Fig. 1C–F but not identical. This is probably for two reasons:
- The paper computed coverage from BAM files with
bedtools coverageBed(bam files), while we used bigWig signal. Small differences in how coverage is summarized can change IP/Input values and therefore the bins that pass the statistical filters. - The authors likely used a slightly different set of anchors (especially for start/stop codons and transcript isoforms). If the anchor definitions differ, the “distance-to-nearest-anchor” frequencies will also shift. This is also suggested by the fact that the public steady-state m6A dataset shows frequency differences too.
Even so, the main pattern matches the paper: TNT-seq shows strong enrichment near the 5′ and 3′ splice junctions, and it also shows enrichment near the start and stop codons. However, the start/stop profiles are generally broader, weaker, and noisier than the splice-junction signal. The normalized plot makes this easier to see: the overall trends remain the same even if the exact peak heights differ, confirming that TNT-seq and m6A-seq follow similar overall profiles.
- Louloupi, A. & Ntini, E. et al. Cell Reports (2018). 10.1016/j.celrep.2018.05.077
- Schwartz, S. et al. Cell Reports (2014). 10.1016/j.celrep.2014.05.048


