LRP2 is a scalable, end-to-end long-read proteogenomics pipeline built in Nextflow. It identifies and validates protein isoforms by integrating PacBio long-read RNA-seq with mass spectrometry. Starting from full-length non-chimeric (FLNC) reads and/or raw MS files, LRP2 performs transcript discovery, ORF prediction, differential analysis, and mass spectrometry-based protein identification.
The LRP2 Nextflow pipeline consists of five subworkflows:
| Subworkflow | Description |
|---|---|
| 1. PacBio Isocall | Align FLNC reads and collapse to isoforms with PacBio Isocall |
| 2. Transcriptome | Classify transcripts with SQANTI3, filter artifacts, assign deterministic hash-based isoform IDs |
| 3. Predicted proteome | Predict ORFs with CPAT, classify proteins with SQANTI protein |
| 4. Multi-sample analysis | Differential expression and usage (edgeR, DRIMSeq), differential splicing (LR LeafCutter, preliminary implementation) |
| 5. Proteomics | Build custom reference database, convert raw MS files, search with FragPipe or MetaMorpheus, map peptides to isoforms |
Note: The Quick Start uses SLURM on a UVA Rivanna-style HPC setup, which is what the pipeline has been most extensively tested with. LSF is also supported via the
lsfprofile. For other schedulers, see Support and customization.Specifics vary by cluster — account flags (
-A), partition names (-p), and module names (nextflow,apptainer) may differ on your system. Check with your HPC documentation or admin.
- Nextflow ≥ 24.04.2 (install guide)
- Singularity/Apptainer or Docker for containerized dependencies
LRP2 is primarily designed for High-Performance Computing (HPC) environments. Although the test dataset can run on a local machine (minimum of 4 CPUs and 32GB RAM recommended), real-world datasets require substantial computational resources to be processed efficiently.
LRP2 uses containers to manage software dependencies. Containers allow for the packaging of libraries, code, and configurations such that each tool in the pipeline can be reliably run in any computing environment without compatibility issues.
To run LRP2, you must have one of the following installed:
-
Singularity/Apptainer (required for HPC): Most HPC systems have Singularity or Apptainer pre-installed as a module. You can check this by running
module avail singularityormodule avail apptainer. If not available, contact your HPC administrator or see Apptainer installation guide. -
Docker (may be used for local systems):
- Installation guides: Docker Desktop (Mac/Windows) or Docker Engine (Linux)
Note: Docker requires root/admin privileges, and is typically not permitted on shared HPC systems. We therefore strongly recommend the use of Singularity/Apptainer.
The pipeline will automatically pull and cache container images on first run. Singularity images are cached in work/singularity/ by default.
git clone https://github.com/sheynkman-lab/LRP2.git
cd LRP2Start a persistent terminal session so the pipeline keeps running if you lose your connection:
screen -S lrp2Tip: To detach from screen, press
Ctrl+AthenD. To reattach later:screen -r lrp2.Certain HPC systems (e.g. UVA Rivanna) only support
screen. On systems that support it, you can use the tmux terminal multiplexer instead by runningtmux new -s lrp2.
Request an interactive job with enough resources for the test dataset:
salloc -c 4 --mem=64G -p your_slurm_partition -A your_allocation --time=4:00:00Note: Adjust for your HPC system. Replace
your_slurm_partitionwith your SLURM partition andyour_allocationwith your SLURM allocation group. UVA Rivanna users can substituteijobforsalloc. The-c(CPUs),--mem(memory), and--timevalues above are sufficient for the test dataset, but should be increased for larger datasets.
Load the required modules:
module load nextflow apptainerFrom the LRP2 directory:
nextflow run . \
-profile test_rna,singularity,slurm \
--outdir test_rna_results \
--hpc_partition your_partitionNote: Replace
your_partitionwith your cluster's partition (SLURM) or queue (LSF) name. If your cluster requires additional scheduler options such as account strings or QOS flags, pass them via --hpc_cluster_options (e.g., --hpc_cluster_options '-A your_allocation' for SLURM, --hpc_cluster_options '-P your_project' for LSF). LSF users: also swapslurmforlsfin the profile. See HPC Scheduler Options.
Note: To run locally on your current node instead of submitting to SLURM, drop the
slurmprofile:-profile test_rna,singularity.
Step 1: Get a FragPipe academic license token.
FragPipe requires an academic license for MSFragger, IonQuant, and diaTracer. Before using FragPipe in LRP2 for the first time, review the academic license agreement. To accept the license and request a one-time token, run the following curl command in your terminal, substituting your information for YOUR_FIRST_NAME, YOUR_LAST_NAME, YOUR_EMAIL, and YOUR_INSTITUTION:
curl --location --request POST \
'https://msfragger-upgrader.nesvilab.org/upgrader/upgrade_download.php' \
--form 'transfer="academic"' \
--form 'agreement2="true"' \
--form 'agreement3="true"' \
--form "first_name=YOUR_FIRST_NAME" \
--form "last_name=YOUR_LAST_NAME" \
--form "email=YOUR_EMAIL" \
--form "organization=YOUR_INSTITUTION" \
--form "download=4.4.1\$zip" \
--form 'is_fragpipe="true"' \
> /dev/null 2>&1Note: Tokens expire quickly. You will need a new token for each run.
Note: Non-academic users: see Support and customization.
Step 2: Check your email for a 6-digit token.
Step 3: Run the test.
From the LRP2 directory:
nextflow run . \
-profile test_dda,singularity,slurm \
--outdir test_results_dda \
--fragpipe_token "YOUR_TOKEN" \
--hpc_partition your_partitionNote: The
test_ddaprofile automatically sets--protein_search fragpipeand--fragpipe_license_accept true.
Note: To run locally instead of submitting to SLURM, drop the
slurmprofile:-profile test_dda,singularity.
RNA samples must be provided as PacBio full-length non-chimeric (FLNC) reads as outputted by PacBio Isoseq refine, in either BAM or FASTQ format. It is assumed that input files are post-processed and have already undergone deconcatenation, demultiplexing, and primer removal. Do NOT provide raw subreads or CCS reads directly from the sequencer.
Protein samples may be either DDA or DIA, and can be provided in .mzML format or vendor-specific raw formats.
Prepare a comma-delimited samplesheet (.csv) describing your input data:
Note: Sample naming conventions in the samplesheet:
- RNA samples: Each RNA sample must have a unique
sample_name. These are used by Isocall to label count matrix columns.- Protein samples: All raw files from the same biological sample (e.g., multiple fractions or injection replicates) must share the same
sample_nameso they are combined and searched together in FragPipe.- Matched RNA + protein samples: Use the same
sample_namefor the RNA and protein entries. The predicted proteome from that sample will be included in the proteomics search database.- Unmatched protein samples: If a protein sample has no matched RNA sample, assign it a
sample_namethat does not match any RNA sample. In this case, only the GENCODE reference proteome will be used as the proteomics search database.
Samplesheet columns:
sample_name: Each RNA sample must have a distinct value. Do not include any spaces in this value.sample_path: Absolute path to the input file.- RNA samples should be PacBio FLNC
.bamor.fastqfiles - Protein samples should be
.rawor.mzMLfiles
- RNA samples should be PacBio FLNC
condition: Sample group (e.g., "control", "treatment"). Used for differential analysis, which performs pairwise comparisons between groups. Two or more groups are supported. If you do not want differential analysis, assign the same condition to all samples. Do not include any spaces in this value.sample_type: Must be eitherRNAorprotein.mass_spec_type: Must be eitherDDAorDIA. Required for protein samples. For RNA samples, specifynonefor this column.
For most datasets, we recommend running the pipeline from a driver shell script submitted to SLURM/LSF. This is more robust than running interactively (as shown in the Quick Start for test data), since large datasets may exceed interactive job time limits or resource quotas.
Create a run_lrp2.sh script with your SLURM/LSF run parameters:
#!/bin/bash
#SBATCH --job-name=lrp2_driver
#SBATCH --partition=your_partition
#SBATCH --account=your_allocation
#SBATCH --time=72:00:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=40G
module load nextflow apptainer
nextflow run /path/to/LRP2 \
--input samplesheet.csv \
--outdir results \
--dataset_name my_dataset \
--genome GRCh38.p14.v49 \
--protein_search fragpipe \
--fragpipe_token "YOUR_TOKEN" \
--hpc_partition your_partition \
-profile singularity,slurmSubmit with:
sbatch run_lrp2.shNote: Customize the above template for your HPC. This includes
#SBATCHdirectives (partition, account) and module names (nextflow, apptainer), and the--hpc_partitionand--hpc_cluster_optionspipeline parameters. For LSF, replace#SBATCHdirectives with#BSUBequivalents and use-profile singularity,lsf. For other schedulers, see Support and customization.Resource allocation works on two levels:
- The driver job (
#SBATCHdirectives in the shell script): modest resources are sufficient — Nextflow itself only orchestrates submissions and doesn't run the heavy work.- Individual pipeline tasks (CPUs, memory, time per process): handled automatically by LRP2's internal configuration. You do not need to specify these on the command line. To customize, edit
conf/base.config.Include
--fragpipe_tokenonly if running the proteomics subworkflow (see Run the RNA + DDA proteomics test dataset for obtaining a token). Differential analysis runs automatically when two or more conditions are present in the samplesheet.
Nextflow profiles control how the pipeline executes. Multiple profiles can be combined using commas (e.g., -profile test_rna,singularity,slurm).
| Profile | Description | Best For |
|---|---|---|
singularity |
Use Singularity/Apptainer containers | HPC systems (most common) |
docker |
Use Docker containers | Local machines with Docker installed |
conda |
Use Conda environments | Systems without container support (slower) |
Note: We have extensively tested the pipeline with
singularitylocally and on HPC systems, and recommend its usage. You may usedockeron local machines. We do not recommend the use ofcondaexcept as a last resort due to it lacking the same reproducibility as containers.
| Profile | Description | When to Use |
|---|---|---|
slurm |
Submit jobs to SLURM scheduler on an HPC | HPC environment |
lsf |
Submit jobs to LSF scheduler on an HPC | HPC environment |
Note: When using
slurmorlsf, Nextflow submits individual pipeline tasks as separate jobs. Without a scheduler profile, all tasks run on the node where you launch Nextflow (requires sufficient resources). If you intend to run locally, you may need to lower resource requirements inconf/base.config.
| Profile | Description | Dataset |
|---|---|---|
test_rna |
RNA-only test dataset | Runs RNA subworkflows (S1 - S4) |
test_dda |
RNA + DDA proteomics test | Runs all subworkflows (S1 - S5) with FragPipe DDA search |
HPC with SLURM (recommended for production):
-profile singularity,slurmQuick RNA test on HPC:
-profile test_rna,singularity,slurmLocal machine with Docker:
-profile dockerThe pipeline supports human and mouse data using GENCODE reference genomes across multiple versions:
- Human:
GRCh38.p14.v49,GRCh38.p14.v48, ...,GRCh38.p14.v44,GRCh38.p13.v43, ...,GRCh37.p13.v19 - Mouse:
GRCm39.vM38,GRCm39.vM37,GRCm39.vM36,GRCm39.vM35,GRCm39.vM34
The pipeline automatically downloads the appropriate FASTA and GTF files based on your --genome selection. Species is auto-detected from --genome and determines which CPAT model (human or mouse) is used for ORF prediction. See conf/gencode_references.config for the full list of supported versions.
Support for RefSeq / igenomes and custom references is under active development.
For a complete list of parameters:
nextflow run /path/to/LRP2 --helpWarning
Please provide pipeline parameters via the CLI as shown or using the Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration other than parameters.
| Parameter | Description | Default |
|---|---|---|
--input |
Path to samplesheet CSV (required) | — |
--outdir |
Path to output directory (required) | — |
--dataset_name |
Run identifier used for output prefixes | merged |
--genome |
Reference genome version | GRCh38.p14.v49 |
Use --hpc_partition for SLURM clusters and --hpc_queue for LSF clusters. SLURM defaults to Rivanna conventions — customize for your cluster.
| Parameter | Description | Default |
|---|---|---|
--hpc_partition |
SLURM partition | standard |
--hpc_queue |
LSF queue name | — |
--hpc_cluster_options |
Additional scheduler-specific options (e.g., '-A my_alloc' for SLURM, '-P my_project' for LSF) | — |
| Parameter | Description | Default |
|---|---|---|
--min_read_support |
Minimum read support for transcripts | 3 |
--isocall_config |
Path to custom Isocall configuration TOML file | bin/isocall_config.toml |
| Parameter | Description | Default |
|---|---|---|
--protein_coding_filter |
Keep only protein-coding genes | true |
--internal_priming_filter |
Remove internal priming artifacts | true |
--template_switching_filter |
Remove template switching artifacts | true |
--transcript_class_keep |
Structural categories to retain (FSM, ISM, NIC, NNC, ALL) | FSM,ISM,NIC,NNC |
| Parameter | Description | Default |
|---|---|---|
--min_orf |
Minimum ORF length in nucleotides | 75 |
--cpat_coding_threshold |
Coding probability threshold | 0.364 (human), 0.44 (mouse) |
--protein_class_keep |
Protein categories to retain | FPM,NPC,NPE |
| Parameter | Description | Default |
|---|---|---|
--min_samples_per_intron |
Minimum samples per intron for leafcutter | 2 |
--min_samples_per_group |
Minimum samples per group for leafcutter | 1 |
--min_usage_ratio |
Minimum junction usage ratio for filtering | 0.01 |
| Parameter | Description | Default |
|---|---|---|
--protein_search |
Search engine: fragpipe (required) |
- |
--fragpipe_token |
Single-use academic license token for FragPipe (required if --protein_search fragpipe). See Run the RNA + DDA proteomics test dataset for how to obtain one. |
— |
--fragpipe_workflow |
Path to a custom FragPipe workflow file specifying search parameters (modifications, enzymes, etc.) | default is selected by mass_spec_type |
Each subworkflow outputs to numbered module directories. The final module in each subworkflow typically contains the key results, while earlier modules contain intermediate files.
<outdir>/
├── S1_PACBIO_ISOCALL/
│ ├── M1_ISOCALL_ALIGN/
│ ├── M2_ISOCALL_PROFILE/
│ ├── M3_ISOCALL_PREP/
│ ├── M4_ISOCALL_MERGE/
│ └── M5_ISOCALL_CALL/ # GTF of transcript structures and count matrix
├── S2_TRANSCRIPTOME/
│ ├── M1_SQANTI_QC/
│ ├── M2_GENERATE_HASHIDS/
│ └── M3_FILTER_TRANSCRIPTOME/ # GTF, BED12, DNA FASTA, count matrix of the refined transcriptome (technical artifacts removed)
├── S3_PREDICTED_PROTEOME/
│ ├── M1_CPAT_ORF/
│ ├── M2_FILTER_CPAT/ # GTF with exon and CDS type columns for single best ORF identified per transcript
│ ├── M3_SQANTI_PROTEIN/
│ └── M4_PROTEIN_CLASSIFICATION/ # GTF, BED12, protein FASTA, count matrix collapsed to distinct ORFs
├── S4_MULTISAMPLE_ANALYSIS/ # (optional)
│ ├── M1_LEAFCUTTER_LONGREAD/ # Differential splicing results
│ └── M2_DIFFERENTIAL_EXPRESSION/ # Differential expression/usage
│ ├── differential_gene_expression/
│ │ ├── *_DGE_edgeR_results.txt
│ │ ├── *_DGE_edgeR_raw_CPM_matrix.txt
│ │ ├── *_DGE_edgeR_normalized_CPM_matrix.txt
│ │ └── *_DGE_MD_plot.pdf
│ ├── differential_transcript_expression/
│ │ ├── *_DTE_edgeR_results.txt
│ │ └── *_DTE_MD_plot.pdf
│ ├── differential_ORF_expression/
│ │ ├── *_DE_ORF_edgeR_results.txt
│ │ └── *_DE_ORF_MD_plot.pdf
│ ├── differential_transcript_usage/
│ │ └── *_DTU_transcript_DRIMSeq_summary.txt
│ └── differential_ORF_usage/
│ └── *_DU_ORF_DRIMSeq_summary.txt
├── S5_PROTEOMICS/ # (optional)
│ ├── M1_BUILD_PROTEOME_REFERENCE/
│ ├── M2_MSCONVERT_MZML/
│ ├── M3_FRAGPIPE/
│ └── M4_NOVEL_PEPTIDES/ # BED12 of peptides mapped to genome, summary table of novel and annotated peptides mapped to isoforms
└── pipeline_info/ # Execution reports and logs
├── execution_report.html
├── execution_timeline.html
└── lrp2_software_versions.yml
For detailed information about output files, please refer to the output documentation.
The LRP2 pipeline was developed through a collaboration by the Sheynkman Lab and Knowles Lab:
- Megan D. Schertzer, Sheynkman Lab - Lead developer
- Julia T. Lewandowski, Knowles Lab - Lead developer
We thank the following people for their extensive assistance in the development of this pipeline:
- Emily F. Watts, Sheynkman Lab - Contributions to LRP and conception of multi-sample analysis subworkflow.
- Madison M. Mehlferber, Sheynkman Lab - Contributor to the original LRP pipeline. Continued pipeline testing and feedback.
- Will Rosenow, Sheynkman Lab - Pipeline testing and feedback.
- Scott I. Adamson, Knowles Lab - Development of leafcutter-py.
- Jocelyne Bruand, Pacific Biosciences - Development of Isocall.
- Elizabeth Tseng, Pacific Biosciences - Development of Isocall.
- Egor Dolzhenko, Pacific Biosciences - Lead Developer of Isocall.
We especially thank the PIs that contributed to this project:
- David A. Knowles, Development of LR LeafCutter and project support / funding
- Gloria Sheynkman, Development/conceptualization of LRP and project support / funding
LRP2 supports a range of customization:
- HPC environment: SLURM and LSF schedulers with configurable partition/queue and cluster options; Singularity, Docker, or Conda containers
- Input flexibility: RNA-only, protein-only, or paired RNA + protein samples; DDA or DIA mass spec data
- Reference genomes: GENCODE human and mouse across multiple versions
- Proteomics: alternative search engines (FragPipe or MetaMorpheus) and customizable FragPipe workflows
We welcome input from the community — please reach out if you have a use case not covered by the defaults.
- Issues and bug reports: GitHub Issues
- Direct contact: Megan Schertzer, cwp5au@virginia.edu
This pipeline is released under the MIT License.
If you use LRP2 in your work, please cite:
Schertzer MD, Lewandowski JT, et al. LRP2: A proteogenomics pipeline for long-read informed protein isoform analysis and discovery. Manuscript in preparation.
LRP2 builds on the original LRP framework:
Miller, R. M., Jordan, B. T., Mehlferber, M. M., et al. 2022. "Enhanced protein isoform characterization through long-read proteogenomics." Genome Biology 23(1): 69. doi: 10.1186/s13059-022-02624-y
Please also cite the tools used by the pipeline:
-
Isocall (PacBio) — github.com/PacificBiosciences/isocall
-
SQANTI3
Pardo-Palacios, F. J., Arzalluz-Luque, A., Kondratova, L., et al. 2024. "SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms." Nature Methods 21(5): 793–797. doi: 10.1038/s41592-024-02229-2
-
CPAT
Wang, L., et al. 2013. "CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model." Nucleic Acids Research 41(6): e74. doi: 10.1093/nar/gkt006
-
SQANTI Protein (originally introduced in LRP)
Miller, R. M., Jordan, B. T., Mehlferber, M. M., et al. 2022. "Enhanced protein isoform characterization through long-read proteogenomics." Genome Biology 23(1): 69. doi: 10.1186/s13059-022-02624-y
-
edgeR
Robinson, M. D., McCarthy, D. J., and Smyth, G. K. 2010. "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26(1): 139–140. doi: 10.1093/bioinformatics/btp616
-
DRIMSeq
Nowicka, M., and Robinson, M. D. 2016. "DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics." F1000Research 5: 1356. doi: 10.12688/f1000research.8900.2
-
LeafCutter (adapted as Long-read LeafCutter in LRP2; under active development)
Li, Y. I., Knowles, D. A., Humphrey, J., et al. 2017. "Annotation-free quantification of RNA splicing using LeafCutter." Nature Genetics 50(1): 151–158. doi: 10.1038/s41588-017-0004-9
-
FragPipe / MSFragger
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D., and Nesvizhskii, A. I. 2017. "MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics." Nature Methods 14(5): 513–520. doi: 10.1038/nmeth.4256
-
IonQuant (label-free quantification in FragPipe)
Yu, F., Haynes, S. E., and Nesvizhskii, A. I. 2021. "IonQuant enables accurate and sensitive label-free quantification with FDR-controlled match-between-runs." Molecular & Cellular Proteomics 20: 100077. doi: 10.1016/j.mcpro.2021.100077
-
MSFragger-DIA (DIA proteomics in FragPipe)
Yu, F., Teo, G. C., Kong, A. T., et al. 2023. "Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform." Nature Communications 14(1): 4154. doi: 10.1038/s41467-023-39869-5
- GENCODE
Frankish, A., et al. 2023. "GENCODE: reference annotation for the human and mouse genomes in 2023." Nucleic Acids Research 51(D1): D942–D949. doi: 10.1093/nar/gkac1071
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., and Nahnsen, S. 2020. "The nf-core framework for community-curated bioinformatics pipelines." Nature Biotechnology 38(3): 276–278. doi: 10.1038/s41587-020-0439-x


