From f3dbdc576a647f0d011cff5a437ec5bd84ae5ee5 Mon Sep 17 00:00:00 2001 From: Yasset Perez-Riverol Date: Wed, 4 Mar 2026 07:04:18 +0000 Subject: [PATCH 1/2] Add protein accession and FASTA header design doc Design for issue #18: unified pgvar|transcript-index|gene format for variant proteins, compatible with all major search engines. Co-Authored-By: Claude Opus 4.6 --- .../2026-03-03-protein-accession-design.md | 96 +++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 docs/plans/2026-03-03-protein-accession-design.md diff --git a/docs/plans/2026-03-03-protein-accession-design.md b/docs/plans/2026-03-03-protein-accession-design.md new file mode 100644 index 0000000..cfe0b8d --- /dev/null +++ b/docs/plans/2026-03-03-protein-accession-design.md @@ -0,0 +1,96 @@ +# Protein Accession and FASTA Header Design + +Issue: https://github.com/bigbio/pgatk/issues/18 +Branch: `feature/protein-accession-design` +Date: 2026-03-03 + +## Problem + +Current pgatk FASTA headers are inconsistent across variant sources (VCF, COSMIC, ClinVar) and incompatible with major search engines like SearchGUI, which cannot parse ENSEMBL-style IDs. + +## Design + +### Two protein categories, two prefix strategies + +| Category | Prefix | Accession | Description | +|----------|--------|-----------|-------------| +| Canonical (reference) | Keep original (`sp\|`, `tr\|`, `ensp\|`) | Original accession | Untouched from source database | +| Variant (mutated) | `pgvar\|` | `{TRANSCRIPT_ID}-{INDEX}` | pgatk-generated variant protein | + +### Variant header format + +``` +>pgvar|{TRANSCRIPT_ID}-{INDEX}|{GENE_SYMBOL} {key=value metadata} +``` + +**Fields:** + +- `pgvar` -- database tag identifying pgatk-generated variant proteins. +- `{TRANSCRIPT_ID}-{INDEX}` -- accession composed of parent transcript ID and a dash-separated 1-based index (per transcript, per run). Mirrors UniProt isoform convention (`P12345-2`). +- `{GENE_SYMBOL}` -- gene name, first token after the second pipe. +- Metadata key=value pairs in the description field: + +| Key | Description | Example | +|-----|-------------|---------| +| `VariantSource` | Origin database | `COSMIC`, `ClinVar`, `gnomAD`, `dbSNP` | +| `GenomicCoord` | `chr:pos:ref:alt` | `12:25245347:C:G` | +| `AAChange` | HGVS protein notation | `p.G13R` | +| `MutationType` | SO term or short label | `missense_variant` | +| `dbSNP` | rsID if available | `rs121913529` | +| `ORF` | Reading frame number (only when multi-ORF) | `1`, `2`, `3` | + +### Examples + +```fasta +# Canonical proteins -- untouched from source databases +>sp|P01112|RASH_HUMAN GTPase HRas OS=Homo sapiens +>ensp|ENSP00000309845|BRCA1 + +# Variant proteins -- unified pgvar| prefix regardless of source +>pgvar|ENST00000311189-1|HRAS VariantSource=COSMIC AAChange=p.G13R MutationType=missense_variant GenomicCoord=12:25245347:C:G +>pgvar|ENST00000311189-2|HRAS VariantSource=COSMIC AAChange=p.Q61L MutationType=missense_variant GenomicCoord=12:25245350:A:T +>pgvar|ENST00000357654-1|BRCA1 VariantSource=ClinVar AAChange=p.R1699Q MutationType=missense_variant GenomicCoord=17:43094464:G:A dbSNP=rs41293455 + +# Multiple ORFs -- each ORF gets its own index, ORF number in metadata +>pgvar|ENST00000311189-3|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=1 +>pgvar|ENST00000311189-4|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=2 +>pgvar|ENST00000311189-5|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=3 +``` + +### Indexing logic + +The `-{index}` is per-transcript, per-file generation run: + +- First variant on `ENST00000311189` gets `-1` +- Second variant on same transcript gets `-2` +- Multi-ORF outputs each consume an index (3 ORFs = 3 indices) +- First variant on a different transcript resets to `-1` + +### Search engine compatibility + +| Engine | Compatible | Notes | +|--------|-----------|-------| +| SearchGUI / PeptideShaker | Yes | Matches UniProt-like `db\|acc\|name` pattern | +| MaxQuant | Yes | Default UniProt parse rule works | +| MSFragger / FragPipe | Yes | Reads full header, splits on first whitespace | +| Comet | Yes | Parses `>db\|acc\|` natively | +| DIA-NN | Yes | Follows UniProt-style parsing | +| Proteome Discoverer | Yes | Supports pipe-delimited headers | + +### Files to modify + +| File | Change | +|------|--------| +| `pgatk/ensembl/ensembl.py` | Refactor `vcf_to_proteindb()` header construction (lines 661-664) | +| `pgatk/clinvar/clinvar_service.py` | Refactor header construction (lines 554-560) | +| `pgatk/cgenomes/cgenomes_proteindb.py` | Refactor COSMIC header (line 317), cBioPortal header | +| `pgatk/toolbox/vcf_utils.py` | Update `write_output()` to handle new format cleanly | +| `pgatk/config/` | Add constants for `PGVAR_PREFIX`, metadata keys | + +### Design decisions + +1. **Dash separator** (`-`) between transcript and index, consistent with UniProt isoform convention. +2. **No ORF suffix in accession** -- ORF number is metadata (`ORF=N`), each ORF gets its own index. +3. **Canonical proteins are pass-through** -- pgatk does not reformat existing database headers. +4. **Unified format across all sources** -- COSMIC, ClinVar, VCF variants all use `pgvar|` regardless of origin. +5. **Key=value metadata** in description field for structured downstream parsing. From 0c8e4e0e727a0154669a7455657d11ac29531831 Mon Sep 17 00:00:00 2001 From: Yasset Perez-Riverol Date: Wed, 4 Mar 2026 07:55:56 +0000 Subject: [PATCH 2/2] Improve README with comprehensive feature docs and usage guide Expands the minimal README with key features, installation methods, quick start example, full command reference table, supported variant sources, use case index, and project structure overview. Co-Authored-By: Claude Opus 4.6 --- README.md | 168 +++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 160 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 1636d6c..29e12ce 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# ProteoGenomics Analysis Toolkit +# pgatk -- ProteoGenomics Analysis Toolkit ![Python application](https://github.com/bigbio/pgatk/workflows/Python%20application/badge.svg) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pgatk/README.html) @@ -6,21 +6,173 @@ [![PyPI version](https://badge.fury.io/py/pgatk.svg)](https://badge.fury.io/py/pgatk) ![PyPI - Downloads](https://img.shields.io/pypi/dm/pgatk) -**pgatk** is a Python library for proteogenomics data analysis. It provides bioinformatics tools to download, translate, and generate protein sequence databases from reference and mutation genome databases. +**pgatk** is a Python toolkit for building proteogenomics protein sequence databases. It downloads, translates, and combines variant and non-canonical sequences from multiple genomic sources into search-ready FASTA databases compatible with all major proteomics search engines. -## Quick Install +## Key Features + +- **Multi-source variant integration** -- Translate variants from ENSEMBL, VCF files, COSMIC, cBioPortal, ClinVar, and gnomAD into protein sequences +- **Non-canonical ORF discovery** -- Three-frame and six-frame translation of lncRNAs, pseudogenes, antisense transcripts, and alternative reading frames +- **Any species** -- Supports all organisms available in ENSEMBL (human, mouse, rice, wheat, etc.) +- **Search engine compatible** -- Output FASTA files work with MaxQuant, SearchGUI, MSFragger, Comet, DIA-NN, and Proteome Discoverer +- **Decoy generation** -- Multiple target-decoy strategies (DecoyPYrat, protein-reverse, protein-shuffle) +- **Peptide-to-genome mapping** -- Map identified peptides back to genomic coordinates (GFF3) for genome browser visualization +- **ClinVar without VEP** -- ClinVar pipeline uses BedTools interval overlap, no VEP annotation required + +## Installation + +### pip (recommended) ```bash pip install pgatk ``` +### Bioconda + +```bash +conda install -c bioconda pgatk +``` + +### From source + +```bash +git clone https://github.com/bigbio/pgatk.git +cd pgatk +pip install . +``` + +## Quick Start + +Build a human variant protein database in four commands: + +```bash +# 1. Download ENSEMBL data for human +pgatk ensembl-downloader -t 9606 -o ensembl_human + +# 2. Extract transcript sequences (requires gffread) +gffread -F -w ensembl_human/transcripts.fa \ + -g ensembl_human/genome.fa \ + ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz + +# 3. Translate variants to protein sequences +pgatk vcf-to-proteindb \ + --vcf ensembl_human/homo_sapiens_incl_consequences.vcf.gz \ + --input_fasta ensembl_human/transcripts.fa \ + --gene_annotations_gtf ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz \ + --output_proteindb variant_proteins.fa + +# 4. Generate target-decoy database +pgatk generate-decoy \ + --input variant_proteins.fa \ + --output target_decoy.fa \ + --method decoypyrat +``` + +## Commands + +### Data Downloaders + +| Command | Description | +|---------|-------------| +| `ensembl-downloader` | Download ENSEMBL reference data (GTF, FASTA, VCF) for any species by taxonomy ID | +| `ncbi-downloader` | Download NCBI RefSeq annotations and ClinVar VCF | +| `cosmic-downloader` | Download COSMIC somatic mutation data (requires account) | +| `cbioportal-downloader` | Download cBioPortal cancer genomics studies | + +### Variant-to-Protein Translation + +| Command | Description | +|---------|-------------| +| `vcf-to-proteindb` | Translate VCF variants (ENSEMBL, gnomAD, patient WES/WGS) to protein sequences | +| `clinvar-to-proteindb` | Translate ClinVar clinical variants (no VEP required) | +| `cosmic-to-proteindb` | Translate COSMIC somatic mutations, with optional tissue-type splitting | +| `cbioportal-to-proteindb` | Translate cBioPortal study mutations to protein sequences | + +### Sequence Translation + +| Command | Description | +|---------|-------------| +| `dnaseq-to-proteindb` | Translate DNA sequences with biotype filtering, multi-frame ORFs, and expression thresholds | +| `threeframe-translation` | Three-frame translation of transcript sequences | + +### Database Processing + +| Command | Description | +|---------|-------------| +| `generate-decoy` | Generate decoy sequences (methods: `decoypyrat`, `protein-reverse`, `protein-shuffle`, `pgdbdeep`) | +| `ensembl-check` | Validate protein database -- filter short sequences, handle stop codons | + +### Post-Processing + +| Command | Description | +|---------|-------------| +| `digest-mutant-protein` | In silico digest of variant proteins, filter against canonical proteome to extract unique peptides | +| `map-peptide2genome` | Map identified peptides to genomic coordinates (GFF3 output) | +| `spectrumai` | Inspect MS2 spectra of peptide identifications | +| `blast_get_position` | BLAST peptides against a reference database | + +## Supported Variant Sources + +| Source | Command | Description | +|--------|---------|-------------| +| ENSEMBL | `vcf-to-proteindb` | Population variants (SNPs, indels) for any ENSEMBL species | +| gnomAD | `vcf-to-proteindb` | Ancestry-stratified population variants (AF_afr, AF_eas, AF_nfe, etc.) | +| ClinVar | `clinvar-to-proteindb` | Clinically annotated pathogenic/benign variants | +| COSMIC | `cosmic-to-proteindb` | Somatic cancer mutations, per tissue type or cell line | +| cBioPortal | `cbioportal-to-proteindb` | Cancer study mutations from TCGA, METABRIC, etc. | +| Custom VCF | `vcf-to-proteindb` | Patient WGS/WES variants from any variant caller (GATK, Strelka, MuTect2) | + +## Use Cases + +Detailed end-to-end workflows are available in [docs/use-cases.md](docs/use-cases.md): + +1. **Cell-type specific non-canonical peptide discovery** -- Reproduce the analysis from Umer et al. 2022 +2. **Human variant protein database** -- Standard ENSEMBL-based variant proteogenomics +3. **Population-specific databases** -- gnomAD ancestry-stratified variant databases +4. **ClinVar clinical variants** -- Clinical variant detection at the protein level +5. **Cancer proteogenomics** -- COSMIC, cBioPortal, and patient-specific tumor databases +6. **Novel ORF and micropeptide discovery** -- lncRNA, pseudogene, and alternative ORF translation +7. **Genome annotation refinement** -- Six-frame translation and peptide-to-genome mapping +8. **Metaproteomics** -- Six-frame translation of metagenome assemblies +9. **Long-read transcriptomics** -- Isoform-resolved protein databases from PacBio/ONT data +10. **Plant and non-model organisms** -- Proteogenomics for any ENSEMBL species + +## Project Structure + +``` +pgatk/ +├── commands/ # CLI command definitions (Click) +├── ensembl/ # ENSEMBL data download and VCF translation +├── cgenomes/ # COSMIC and cBioPortal handling +├── clinvar/ # ClinVar variant translation +├── proteogenomics/ # Spectral validation tools +├── proteomics/ # Protein database utilities (decoy generation) +├── db/ # Peptide digestion and genome mapping +├── config/ # YAML configuration files +└── toolbox/ # Shared utilities +``` + ## Full Documentation [https://pgatk.quantms.org](https://pgatk.quantms.org) -## Cite as +## Cite + +If you use pgatk in your research, please cite: + +> Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol. +> **Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.** +> *Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470--1472. +> [https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838) + +## Contributing + +```bash +git clone https://github.com/bigbio/pgatk.git +cd pgatk +pip install -e ".[dev]" +pytest +``` + +## License -Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol. -Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides. -*Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470-1472. -[https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838) +Apache License 2.0