PATTY: a computational method for correcting open chromatin bias in bulk and single-cell CUT&Tag data
Precise profiling of epigenomes is essential for better understanding chromatin biology and gene regulation. Cleavage Under Targets & Tagmentation (CUT&Tag) is an easy and low-cost epigenomic profiling technique that can be performed on a low number of cells and at the single-cell level. With its growing adoption, CUT&Tag datasets spanning diverse biological systems are rapidly accumulating in the field. CUT&Tag assays use the hyperactive transposase Tn5 for DNA tagmentation. Tn5’s preference toward accessible chromatin alters CUT&Tag sequence read distributions in the genome and introduces open chromatin bias that can confound downstream analysis, an issue more substantial in sparse single-cell data. We show that open chromatin bias extensively exists in published CUT&Tag datasets, including those generated with recently optimized high-salt protocols. To address this challange, we present PATTY (Propensity Analyzer for Tn5 Transposase Yielded bias), a comprehensive computational method that corrects open chromatin bias in CUT&Tag data by leveraging accompanying ATAC-seq. By integrating transcriptomic and epigenomic data using machine learning and integrative modeling, we demonstrate that PATTY enables accurate and robust detection of occupancy sites for both active and repressive histone modifications, including H3K27ac, H3K27me3, and H3K9me3, with experimental validation. We further develop a single-cell CUT&Tag analysis framework built on PATTY and show improved cell clustering when using bias-corrected single-cell CUT&Tag data compared to using uncorrected data. Beyond CUT&Tag, PATTY sets a foundation for further development of bias correction methods for improving data analysis for all Tn5-based high-throughput assays.
Our manuscript is now available on Biorxiv
PATTY is a computational tool designed to correct open chromatin bias in CUT&Tag data at both bulk and single-cell levels. Current version of PATTY support open chromatin bias correction for H3K27me3, H3K27ac, and H3K9me3. It leverages a pre-trained logistic regression model, built using CUT&Tag data in the K562 cell line, to correct bias for specific histone modifications.
-
Bulk mode: PATTY applies the correction model to genome-wide 200bp tiling bins, and generates a bias-corrected score on each candidate bin.
-
Single-cell mode: PATTY performs bias correction at the individual cell level, producing a 200bp-bin by cell matrix of corrected signals. It then supports downstream cell clustering analysis using the bias-corrected data to improve biological interpretability and resolution.
-
Changelog
v1.0 PATTY for biorxiv manuscript and initial submission
v1.1 Improve the installation steps. Designed for paper revision.
-
Package requirements
PATTY requires Linux or MacOS as OS.
PATTY requires Python 3.6+ and Rscript v3+ to run.
PATTY requires Python packages scipy, numpy, pandas, and joblib pre-installed. PATTY sc mode requires Rpackage ArchR pre-installed. -
Genome-wide mappable region annotation
The genome-wide annotation file for hg38 and mm10 genome can be downloaded here and input when running PATTY.
# for root user
$ cd PATTY
$ sudo python setup.py install # if you are not the root user, you can install PATTY at a specific location where you have write permission
$ python setup.py install --prefix /home/PATTY # Here you can replace “/home/PATTY” with any location
$ export PATH=/home/PATTY/bin:$PATH # setup PATH for the software
$ export PYTHONPATH=/home/PATTY/lib/python3.6/site-packages:$PYTHONPATH # setup PYTHONPATH for module import# To check the PATTY package, just type:
$ PATTY --help # If you see the help manual, you have successfully installed PATTY# NOTE:
- To install PATTY on MacOS, the users need to download and install Command Line Tools beforehand
- Bedtools (Quinlan et al., Bioinformatics, 2010) and UCSC tools (Kuhn et al., Brief Bioinform. 2013) will be installed automatically if not installed.
To run PATTY with the default parameters, you can set the following parameters:
- -m MODE, --mode=MODE Mode of PATTY, choose from bulk or sc(single-cell)
- -c CUTTAG, --cuttag=CUTTAG CUTTAG Input fragments file in (paired/single end) bed format for CUT&Tag data, with .bed extension (or .bed.gz for comparessed file). For sc mode, the 4th(name) column of the bed file represents the name/barcode of the corresponding individual cell
- -a ATAC, --atac=ATAC ATAC Input fragments file in bed format for ATAC-seq data, with .bed extension(or .bed.gz for comparessed file). The ATAC-seq fragments were used as bulk data for both sc and bulk modes (only chrm,start,end 3columns are required)
- -f FACTOR, --factor=FACTOR FACTOR Factor type of the CUT&Tag data. Currently PATTY support H3K27me3 (default), H3K27ac, and H3K9me3
- -g GENOME, --genome=GENOME genome version of the input data, choose from hg38 (default) and mm10
- -o OUTNAME, --outname=OUTNAME Name of output results
Example of running PATTY with default parameters (test data downloadable in :
# bulk mode
$ PATTY -m bulk -c ${path}/testdata_bulk_H3K27me3_reads.bed.gz -a ${path}/testdata_bulk_ATAC_reads.bed.gz -f H3K27me3 -o testbulk # sc mode
$ PATTY -m sc -c ${path}/testdata_sc_H3K27me3_reads.bed.gz -a ${path}/testdata_sc_ATAC_reads.bed.gz -f H3K27me3 -o testsc PATTY takes aligned fragment files in BED format as input(or .bed.gz for gzip comparessed file). Users may apply any preferred pre-processing pipeline to generate these files. We recommend retaining only high-quality reads with MAPQ > 30 to ensure accurate bias correction. Note that PATTY takes original fragments bed files as input (e.g., transformed directly from aligned BAM files, or 10x cell ranger outputed fragments.tsv file for sc data). Please don't do any customized extension or shifting.
The expected BED format varies depending on data type:
chr1 10500 10646 . . +
chr2 20840 20986 . . -
The 4-5th column represents an optional placeholder.
chr1 10500 10646
chr2 20840 21000
chr1 10500 10646 CellA
chr2 20840 21000 CellB
The 4th column must contain the cell barcode or cell name (like AATAACTACGCC-1).
NAME_PATTYscore.bw
A 200bp-resolution genome-wide track in bigWig format containing the PATTY scores for each candidate bin.- Scores range from 0 to 1. Higher scores indicate higher confidence of true histone mark occupancy, while lower scores reflect likely false-positive or background signals due to open chromatin bias.
NAME_binXcell.txt.gz
A bin-by-cell PATTY score matrix generated from single-cell CUT&Tag analysis.- Rows: 200bp bins
- Columns: individual cells
- Values: Similar PATTY score like in Bulk mode but for each individual cell
We provided the test data for users to test PATTY. The sc/bulk output can also be generated with the command lines in Section 2 using the testing data as input. Click the file names to download.
- testing data for bulk mode:
- testing data for sc mode:
You can also set the following parameters for more accurate bias estimation and correction:
- --binMinReads=BINMINREADS
[optional] Bins with < 5(default) reads covered will be discarded in the analysis. For sc mode, bins with a total of < 5 (default) reads across all high-quality cells will be discarded. set 0 to turn off this parameter. - --binList=BINLIST
[optional] Bed file for inputting candidate bins/peaks for the analysis. When inputted, the correction will be done in only these bins for bulk mode. The correction/clustering will be only done these bins for sc mode. This parameter is designed for customized high-reads bin (bulk mode) or high-var bin (sc mode). The inputed peaks/bins will be transformed/splited to 200bp bins for as the input. - --cellnames=CELLNAMES
[optional] Single column plain text file for name list of used individual cells, each line contain the name of the individual cell. This parameter is only used for sc mode. - --readCutoff=READCUTOFF
[sc optional] Reads number cutoff for high-quality cells. Cells with < 10000(default) reads will be discarded in the analysis. Users can change this parameter for samples with low sequencing depth to include more cells in the analysis. Setting a lower number for this parameter may decrease the accuracy of clustering results due to the low-quality cells. - --clusterMethod=CLUSTERMETHOD
[sc optional] Method used for single-cell clustering analysis. The default is K-means (PCA dim reduction + K-means clustering). Optional choices (Seurat and scran) require related packages installed (described in section x) - --clusterNum=CLUSTERNUM
[sc optional] Number of clusters specified for K-means clustering and only used for the PCAkm (setting by --clusterMethod) method. The default is 7. - --UMAP
[sc optional] Turn on this parameter to generate a UMAP plot for the clustering results. - --overwrite
[optional] Force overwrite; setting this parameter will remove the existing result! PATTY will terminate if there is a folder with the same name as -o in the working directory. Set this parameter to force PATTY to run. - --keeptmp
[optional] Whether or not to keep the intermediate results (tmpResults/)
Users can reproduce the bias correction results from the manuscript (Figure 4A, G, H, H3K27me3 CUT&Tag rep1) by running PATTY with the following command line:
$ PATTY -m bulk -c ${path}/H3K27me3_CUTTag_rep1.bed.gz -a ${path}/ATAC.bed.gz -f H3K27me3 -o bulkH3K27me3 Download input CUT&Tag and ATAC data, and example output here.