diff --git a/README.md b/README.md index 1636d6c..29e12ce 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# ProteoGenomics Analysis Toolkit +# pgatk -- ProteoGenomics Analysis Toolkit ![Python application](https://github.com/bigbio/pgatk/workflows/Python%20application/badge.svg) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pgatk/README.html) @@ -6,21 +6,173 @@ [![PyPI version](https://badge.fury.io/py/pgatk.svg)](https://badge.fury.io/py/pgatk) ![PyPI - Downloads](https://img.shields.io/pypi/dm/pgatk) -**pgatk** is a Python library for proteogenomics data analysis. It provides bioinformatics tools to download, translate, and generate protein sequence databases from reference and mutation genome databases. +**pgatk** is a Python toolkit for building proteogenomics protein sequence databases. It downloads, translates, and combines variant and non-canonical sequences from multiple genomic sources into search-ready FASTA databases compatible with all major proteomics search engines. -## Quick Install +## Key Features + +- **Multi-source variant integration** -- Translate variants from ENSEMBL, VCF files, COSMIC, cBioPortal, ClinVar, and gnomAD into protein sequences +- **Non-canonical ORF discovery** -- Three-frame and six-frame translation of lncRNAs, pseudogenes, antisense transcripts, and alternative reading frames +- **Any species** -- Supports all organisms available in ENSEMBL (human, mouse, rice, wheat, etc.) +- **Search engine compatible** -- Output FASTA files work with MaxQuant, SearchGUI, MSFragger, Comet, DIA-NN, and Proteome Discoverer +- **Decoy generation** -- Multiple target-decoy strategies (DecoyPYrat, protein-reverse, protein-shuffle) +- **Peptide-to-genome mapping** -- Map identified peptides back to genomic coordinates (GFF3) for genome browser visualization +- **ClinVar without VEP** -- ClinVar pipeline uses BedTools interval overlap, no VEP annotation required + +## Installation + +### pip (recommended) ```bash pip install pgatk ``` +### Bioconda + +```bash +conda install -c bioconda pgatk +``` + +### From source + +```bash +git clone https://github.com/bigbio/pgatk.git +cd pgatk +pip install . +``` + +## Quick Start + +Build a human variant protein database in four commands: + +```bash +# 1. Download ENSEMBL data for human +pgatk ensembl-downloader -t 9606 -o ensembl_human + +# 2. Extract transcript sequences (requires gffread) +gffread -F -w ensembl_human/transcripts.fa \ + -g ensembl_human/genome.fa \ + ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz + +# 3. Translate variants to protein sequences +pgatk vcf-to-proteindb \ + --vcf ensembl_human/homo_sapiens_incl_consequences.vcf.gz \ + --input_fasta ensembl_human/transcripts.fa \ + --gene_annotations_gtf ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz \ + --output_proteindb variant_proteins.fa + +# 4. Generate target-decoy database +pgatk generate-decoy \ + --input variant_proteins.fa \ + --output target_decoy.fa \ + --method decoypyrat +``` + +## Commands + +### Data Downloaders + +| Command | Description | +|---------|-------------| +| `ensembl-downloader` | Download ENSEMBL reference data (GTF, FASTA, VCF) for any species by taxonomy ID | +| `ncbi-downloader` | Download NCBI RefSeq annotations and ClinVar VCF | +| `cosmic-downloader` | Download COSMIC somatic mutation data (requires account) | +| `cbioportal-downloader` | Download cBioPortal cancer genomics studies | + +### Variant-to-Protein Translation + +| Command | Description | +|---------|-------------| +| `vcf-to-proteindb` | Translate VCF variants (ENSEMBL, gnomAD, patient WES/WGS) to protein sequences | +| `clinvar-to-proteindb` | Translate ClinVar clinical variants (no VEP required) | +| `cosmic-to-proteindb` | Translate COSMIC somatic mutations, with optional tissue-type splitting | +| `cbioportal-to-proteindb` | Translate cBioPortal study mutations to protein sequences | + +### Sequence Translation + +| Command | Description | +|---------|-------------| +| `dnaseq-to-proteindb` | Translate DNA sequences with biotype filtering, multi-frame ORFs, and expression thresholds | +| `threeframe-translation` | Three-frame translation of transcript sequences | + +### Database Processing + +| Command | Description | +|---------|-------------| +| `generate-decoy` | Generate decoy sequences (methods: `decoypyrat`, `protein-reverse`, `protein-shuffle`, `pgdbdeep`) | +| `ensembl-check` | Validate protein database -- filter short sequences, handle stop codons | + +### Post-Processing + +| Command | Description | +|---------|-------------| +| `digest-mutant-protein` | In silico digest of variant proteins, filter against canonical proteome to extract unique peptides | +| `map-peptide2genome` | Map identified peptides to genomic coordinates (GFF3 output) | +| `spectrumai` | Inspect MS2 spectra of peptide identifications | +| `blast_get_position` | BLAST peptides against a reference database | + +## Supported Variant Sources + +| Source | Command | Description | +|--------|---------|-------------| +| ENSEMBL | `vcf-to-proteindb` | Population variants (SNPs, indels) for any ENSEMBL species | +| gnomAD | `vcf-to-proteindb` | Ancestry-stratified population variants (AF_afr, AF_eas, AF_nfe, etc.) | +| ClinVar | `clinvar-to-proteindb` | Clinically annotated pathogenic/benign variants | +| COSMIC | `cosmic-to-proteindb` | Somatic cancer mutations, per tissue type or cell line | +| cBioPortal | `cbioportal-to-proteindb` | Cancer study mutations from TCGA, METABRIC, etc. | +| Custom VCF | `vcf-to-proteindb` | Patient WGS/WES variants from any variant caller (GATK, Strelka, MuTect2) | + +## Use Cases + +Detailed end-to-end workflows are available in [docs/use-cases.md](docs/use-cases.md): + +1. **Cell-type specific non-canonical peptide discovery** -- Reproduce the analysis from Umer et al. 2022 +2. **Human variant protein database** -- Standard ENSEMBL-based variant proteogenomics +3. **Population-specific databases** -- gnomAD ancestry-stratified variant databases +4. **ClinVar clinical variants** -- Clinical variant detection at the protein level +5. **Cancer proteogenomics** -- COSMIC, cBioPortal, and patient-specific tumor databases +6. **Novel ORF and micropeptide discovery** -- lncRNA, pseudogene, and alternative ORF translation +7. **Genome annotation refinement** -- Six-frame translation and peptide-to-genome mapping +8. **Metaproteomics** -- Six-frame translation of metagenome assemblies +9. **Long-read transcriptomics** -- Isoform-resolved protein databases from PacBio/ONT data +10. **Plant and non-model organisms** -- Proteogenomics for any ENSEMBL species + +## Project Structure + +``` +pgatk/ +├── commands/ # CLI command definitions (Click) +├── ensembl/ # ENSEMBL data download and VCF translation +├── cgenomes/ # COSMIC and cBioPortal handling +├── clinvar/ # ClinVar variant translation +├── proteogenomics/ # Spectral validation tools +├── proteomics/ # Protein database utilities (decoy generation) +├── db/ # Peptide digestion and genome mapping +├── config/ # YAML configuration files +└── toolbox/ # Shared utilities +``` + ## Full Documentation [https://pgatk.quantms.org](https://pgatk.quantms.org) -## Cite as +## Cite + +If you use pgatk in your research, please cite: + +> Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol. +> **Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.** +> *Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470--1472. +> [https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838) + +## Contributing + +```bash +git clone https://github.com/bigbio/pgatk.git +cd pgatk +pip install -e ".[dev]" +pytest +``` + +## License -Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol. -Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides. -*Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470-1472. -[https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838) +Apache License 2.0 diff --git a/docs/plans/2026-03-03-protein-accession-design.md b/docs/plans/2026-03-03-protein-accession-design.md new file mode 100644 index 0000000..cfe0b8d --- /dev/null +++ b/docs/plans/2026-03-03-protein-accession-design.md @@ -0,0 +1,96 @@ +# Protein Accession and FASTA Header Design + +Issue: https://github.com/bigbio/pgatk/issues/18 +Branch: `feature/protein-accession-design` +Date: 2026-03-03 + +## Problem + +Current pgatk FASTA headers are inconsistent across variant sources (VCF, COSMIC, ClinVar) and incompatible with major search engines like SearchGUI, which cannot parse ENSEMBL-style IDs. + +## Design + +### Two protein categories, two prefix strategies + +| Category | Prefix | Accession | Description | +|----------|--------|-----------|-------------| +| Canonical (reference) | Keep original (`sp\|`, `tr\|`, `ensp\|`) | Original accession | Untouched from source database | +| Variant (mutated) | `pgvar\|` | `{TRANSCRIPT_ID}-{INDEX}` | pgatk-generated variant protein | + +### Variant header format + +``` +>pgvar|{TRANSCRIPT_ID}-{INDEX}|{GENE_SYMBOL} {key=value metadata} +``` + +**Fields:** + +- `pgvar` -- database tag identifying pgatk-generated variant proteins. +- `{TRANSCRIPT_ID}-{INDEX}` -- accession composed of parent transcript ID and a dash-separated 1-based index (per transcript, per run). Mirrors UniProt isoform convention (`P12345-2`). +- `{GENE_SYMBOL}` -- gene name, first token after the second pipe. +- Metadata key=value pairs in the description field: + +| Key | Description | Example | +|-----|-------------|---------| +| `VariantSource` | Origin database | `COSMIC`, `ClinVar`, `gnomAD`, `dbSNP` | +| `GenomicCoord` | `chr:pos:ref:alt` | `12:25245347:C:G` | +| `AAChange` | HGVS protein notation | `p.G13R` | +| `MutationType` | SO term or short label | `missense_variant` | +| `dbSNP` | rsID if available | `rs121913529` | +| `ORF` | Reading frame number (only when multi-ORF) | `1`, `2`, `3` | + +### Examples + +```fasta +# Canonical proteins -- untouched from source databases +>sp|P01112|RASH_HUMAN GTPase HRas OS=Homo sapiens +>ensp|ENSP00000309845|BRCA1 + +# Variant proteins -- unified pgvar| prefix regardless of source +>pgvar|ENST00000311189-1|HRAS VariantSource=COSMIC AAChange=p.G13R MutationType=missense_variant GenomicCoord=12:25245347:C:G +>pgvar|ENST00000311189-2|HRAS VariantSource=COSMIC AAChange=p.Q61L MutationType=missense_variant GenomicCoord=12:25245350:A:T +>pgvar|ENST00000357654-1|BRCA1 VariantSource=ClinVar AAChange=p.R1699Q MutationType=missense_variant GenomicCoord=17:43094464:G:A dbSNP=rs41293455 + +# Multiple ORFs -- each ORF gets its own index, ORF number in metadata +>pgvar|ENST00000311189-3|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=1 +>pgvar|ENST00000311189-4|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=2 +>pgvar|ENST00000311189-5|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=3 +``` + +### Indexing logic + +The `-{index}` is per-transcript, per-file generation run: + +- First variant on `ENST00000311189` gets `-1` +- Second variant on same transcript gets `-2` +- Multi-ORF outputs each consume an index (3 ORFs = 3 indices) +- First variant on a different transcript resets to `-1` + +### Search engine compatibility + +| Engine | Compatible | Notes | +|--------|-----------|-------| +| SearchGUI / PeptideShaker | Yes | Matches UniProt-like `db\|acc\|name` pattern | +| MaxQuant | Yes | Default UniProt parse rule works | +| MSFragger / FragPipe | Yes | Reads full header, splits on first whitespace | +| Comet | Yes | Parses `>db\|acc\|` natively | +| DIA-NN | Yes | Follows UniProt-style parsing | +| Proteome Discoverer | Yes | Supports pipe-delimited headers | + +### Files to modify + +| File | Change | +|------|--------| +| `pgatk/ensembl/ensembl.py` | Refactor `vcf_to_proteindb()` header construction (lines 661-664) | +| `pgatk/clinvar/clinvar_service.py` | Refactor header construction (lines 554-560) | +| `pgatk/cgenomes/cgenomes_proteindb.py` | Refactor COSMIC header (line 317), cBioPortal header | +| `pgatk/toolbox/vcf_utils.py` | Update `write_output()` to handle new format cleanly | +| `pgatk/config/` | Add constants for `PGVAR_PREFIX`, metadata keys | + +### Design decisions + +1. **Dash separator** (`-`) between transcript and index, consistent with UniProt isoform convention. +2. **No ORF suffix in accession** -- ORF number is metadata (`ORF=N`), each ORF gets its own index. +3. **Canonical proteins are pass-through** -- pgatk does not reformat existing database headers. +4. **Unified format across all sources** -- COSMIC, ClinVar, VCF variants all use `pgvar|` regardless of origin. +5. **Key=value metadata** in description field for structured downstream parsing.