Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 160 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,178 @@
# ProteoGenomics Analysis Toolkit
# pgatk -- ProteoGenomics Analysis Toolkit

![Python application](https://github.com/bigbio/pgatk/workflows/Python%20application/badge.svg)
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pgatk/README.html)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/f6d030fd7d69413987f7265a01193324)](https://www.codacy.com/gh/bigbio/pgatk/dashboard?utm_source=github.com&utm_medium=referral&utm_content=bigbio/pgatk&utm_campaign=Badge_Grade)
[![PyPI version](https://badge.fury.io/py/pgatk.svg)](https://badge.fury.io/py/pgatk)
![PyPI - Downloads](https://img.shields.io/pypi/dm/pgatk)

**pgatk** is a Python library for proteogenomics data analysis. It provides bioinformatics tools to download, translate, and generate protein sequence databases from reference and mutation genome databases.
**pgatk** is a Python toolkit for building proteogenomics protein sequence databases. It downloads, translates, and combines variant and non-canonical sequences from multiple genomic sources into search-ready FASTA databases compatible with all major proteomics search engines.

## Quick Install
## Key Features

- **Multi-source variant integration** -- Translate variants from ENSEMBL, VCF files, COSMIC, cBioPortal, ClinVar, and gnomAD into protein sequences
- **Non-canonical ORF discovery** -- Three-frame and six-frame translation of lncRNAs, pseudogenes, antisense transcripts, and alternative reading frames
- **Any species** -- Supports all organisms available in ENSEMBL (human, mouse, rice, wheat, etc.)
- **Search engine compatible** -- Output FASTA files work with MaxQuant, SearchGUI, MSFragger, Comet, DIA-NN, and Proteome Discoverer
- **Decoy generation** -- Multiple target-decoy strategies (DecoyPYrat, protein-reverse, protein-shuffle)
- **Peptide-to-genome mapping** -- Map identified peptides back to genomic coordinates (GFF3) for genome browser visualization
- **ClinVar without VEP** -- ClinVar pipeline uses BedTools interval overlap, no VEP annotation required

## Installation

### pip (recommended)

```bash
pip install pgatk
```

### Bioconda

```bash
conda install -c bioconda pgatk
```

### From source

```bash
git clone https://github.com/bigbio/pgatk.git
cd pgatk
pip install .
```

## Quick Start

Build a human variant protein database in four commands:

```bash
# 1. Download ENSEMBL data for human
pgatk ensembl-downloader -t 9606 -o ensembl_human

# 2. Extract transcript sequences (requires gffread)
gffread -F -w ensembl_human/transcripts.fa \
-g ensembl_human/genome.fa \
ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz
Comment on lines +52 to +54
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Genome.fa not produced 🐞 Bug ⛯ Reliability

README Quick Start tells users to run gffread with -g ensembl_human/genome.fa, but
ensembl-downloader downloads the genome as a versioned *.dna_sm.toplevel.fa.gz file and the
codebase does not create a genome.fa convenience file. The Quick Start will fail unless the user
manually renames/decompresses, which is not documented.
Agent Prompt
## Issue description
README Quick Start uses `ensembl_human/genome.fa`, but `ensembl-downloader` saves the genome as a versioned `*.dna_sm.toplevel.fa.gz`. Users following the README will not find `genome.fa`.

## Issue Context
Downloader code constructs the genome filename as `{Species}.{Assembly}.dna_sm.toplevel.fa.gz` and downloads it directly.

## Fix Focus Areas
- README.md[51-55]
- pgatk/ensembl/data_downloader.py[474-483] (reference for actual filename pattern)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


# 3. Translate variants to protein sequences
pgatk vcf-to-proteindb \
--vcf ensembl_human/homo_sapiens_incl_consequences.vcf.gz \
--input_fasta ensembl_human/transcripts.fa \
--gene_annotations_gtf ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz \
--output_proteindb variant_proteins.fa

# 4. Generate target-decoy database
pgatk generate-decoy \
--input variant_proteins.fa \
--output target_decoy.fa \
Comment on lines +64 to +66
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Wrong decoy cli flags 🐞 Bug ✓ Correctness

README Quick Start uses --input/--output for pgatk generate-decoy, but the CLI only defines
--input_database/--output_database (and -in/-out). Following the README will fail with Click “No
such option” errors and blocks the Quick Start.
Agent Prompt
## Issue description
README Quick Start documents `pgatk generate-decoy` with `--input` and `--output`, but the CLI only supports `--input_database` / `--output_database` (and `-in` / `-out`). Users will hit a Click error and cannot complete the Quick Start.

## Issue Context
The Click command is defined in `pgatk/commands/proteindb_decoy.py` and does not include `--input`/`--output` aliases.

## Fix Focus Areas
- README.md[63-67]
- pgatk/commands/proteindb_decoy.py[12-17] (optional: if adding aliases)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

--method decoypyrat
```

## Commands

### Data Downloaders

| Command | Description |
|---------|-------------|
| `ensembl-downloader` | Download ENSEMBL reference data (GTF, FASTA, VCF) for any species by taxonomy ID |
| `ncbi-downloader` | Download NCBI RefSeq annotations and ClinVar VCF |
| `cosmic-downloader` | Download COSMIC somatic mutation data (requires account) |
| `cbioportal-downloader` | Download cBioPortal cancer genomics studies |

### Variant-to-Protein Translation

| Command | Description |
|---------|-------------|
| `vcf-to-proteindb` | Translate VCF variants (ENSEMBL, gnomAD, patient WES/WGS) to protein sequences |
| `clinvar-to-proteindb` | Translate ClinVar clinical variants (no VEP required) |
| `cosmic-to-proteindb` | Translate COSMIC somatic mutations, with optional tissue-type splitting |
| `cbioportal-to-proteindb` | Translate cBioPortal study mutations to protein sequences |

### Sequence Translation

| Command | Description |
|---------|-------------|
| `dnaseq-to-proteindb` | Translate DNA sequences with biotype filtering, multi-frame ORFs, and expression thresholds |
| `threeframe-translation` | Three-frame translation of transcript sequences |

### Database Processing

| Command | Description |
|---------|-------------|
| `generate-decoy` | Generate decoy sequences (methods: `decoypyrat`, `protein-reverse`, `protein-shuffle`, `pgdbdeep`) |
| `ensembl-check` | Validate protein database -- filter short sequences, handle stop codons |

### Post-Processing

| Command | Description |
|---------|-------------|
| `digest-mutant-protein` | In silico digest of variant proteins, filter against canonical proteome to extract unique peptides |
| `map-peptide2genome` | Map identified peptides to genomic coordinates (GFF3 output) |
| `spectrumai` | Inspect MS2 spectra of peptide identifications |
| `blast_get_position` | BLAST peptides against a reference database |

## Supported Variant Sources

| Source | Command | Description |
|--------|---------|-------------|
| ENSEMBL | `vcf-to-proteindb` | Population variants (SNPs, indels) for any ENSEMBL species |
| gnomAD | `vcf-to-proteindb` | Ancestry-stratified population variants (AF_afr, AF_eas, AF_nfe, etc.) |
| ClinVar | `clinvar-to-proteindb` | Clinically annotated pathogenic/benign variants |
| COSMIC | `cosmic-to-proteindb` | Somatic cancer mutations, per tissue type or cell line |
| cBioPortal | `cbioportal-to-proteindb` | Cancer study mutations from TCGA, METABRIC, etc. |
| Custom VCF | `vcf-to-proteindb` | Patient WGS/WES variants from any variant caller (GATK, Strelka, MuTect2) |

## Use Cases

Detailed end-to-end workflows are available in [docs/use-cases.md](docs/use-cases.md):

1. **Cell-type specific non-canonical peptide discovery** -- Reproduce the analysis from Umer et al. 2022
2. **Human variant protein database** -- Standard ENSEMBL-based variant proteogenomics
3. **Population-specific databases** -- gnomAD ancestry-stratified variant databases
4. **ClinVar clinical variants** -- Clinical variant detection at the protein level
5. **Cancer proteogenomics** -- COSMIC, cBioPortal, and patient-specific tumor databases
6. **Novel ORF and micropeptide discovery** -- lncRNA, pseudogene, and alternative ORF translation
7. **Genome annotation refinement** -- Six-frame translation and peptide-to-genome mapping
8. **Metaproteomics** -- Six-frame translation of metagenome assemblies
9. **Long-read transcriptomics** -- Isoform-resolved protein databases from PacBio/ONT data
10. **Plant and non-model organisms** -- Proteogenomics for any ENSEMBL species

## Project Structure

```
pgatk/
├── commands/ # CLI command definitions (Click)
├── ensembl/ # ENSEMBL data download and VCF translation
├── cgenomes/ # COSMIC and cBioPortal handling
├── clinvar/ # ClinVar variant translation
├── proteogenomics/ # Spectral validation tools
├── proteomics/ # Protein database utilities (decoy generation)
├── db/ # Peptide digestion and genome mapping
├── config/ # YAML configuration files
└── toolbox/ # Shared utilities
```

## Full Documentation

[https://pgatk.quantms.org](https://pgatk.quantms.org)

## Cite as
## Cite

If you use pgatk in your research, please cite:

> Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol.
> **Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.**
> *Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470--1472.
> [https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838)

## Contributing

```bash
git clone https://github.com/bigbio/pgatk.git
cd pgatk
pip install -e ".[dev]"
pytest
```

## License

Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol.
Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.
*Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470-1472.
[https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838)
Apache License 2.0
96 changes: 96 additions & 0 deletions docs/plans/2026-03-03-protein-accession-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Protein Accession and FASTA Header Design

Issue: https://github.com/bigbio/pgatk/issues/18
Branch: `feature/protein-accession-design`
Date: 2026-03-03

## Problem

Current pgatk FASTA headers are inconsistent across variant sources (VCF, COSMIC, ClinVar) and incompatible with major search engines like SearchGUI, which cannot parse ENSEMBL-style IDs.

## Design

### Two protein categories, two prefix strategies

| Category | Prefix | Accession | Description |
|----------|--------|-----------|-------------|
| Canonical (reference) | Keep original (`sp\|`, `tr\|`, `ensp\|`) | Original accession | Untouched from source database |
| Variant (mutated) | `pgvar\|` | `{TRANSCRIPT_ID}-{INDEX}` | pgatk-generated variant protein |

### Variant header format

```
>pgvar|{TRANSCRIPT_ID}-{INDEX}|{GENE_SYMBOL} {key=value metadata}
```

**Fields:**

- `pgvar` -- database tag identifying pgatk-generated variant proteins.
- `{TRANSCRIPT_ID}-{INDEX}` -- accession composed of parent transcript ID and a dash-separated 1-based index (per transcript, per run). Mirrors UniProt isoform convention (`P12345-2`).
- `{GENE_SYMBOL}` -- gene name, first token after the second pipe.
- Metadata key=value pairs in the description field:

| Key | Description | Example |
|-----|-------------|---------|
| `VariantSource` | Origin database | `COSMIC`, `ClinVar`, `gnomAD`, `dbSNP` |
| `GenomicCoord` | `chr:pos:ref:alt` | `12:25245347:C:G` |
| `AAChange` | HGVS protein notation | `p.G13R` |
| `MutationType` | SO term or short label | `missense_variant` |
| `dbSNP` | rsID if available | `rs121913529` |
| `ORF` | Reading frame number (only when multi-ORF) | `1`, `2`, `3` |

### Examples

```fasta
# Canonical proteins -- untouched from source databases
>sp|P01112|RASH_HUMAN GTPase HRas OS=Homo sapiens
>ensp|ENSP00000309845|BRCA1

# Variant proteins -- unified pgvar| prefix regardless of source
>pgvar|ENST00000311189-1|HRAS VariantSource=COSMIC AAChange=p.G13R MutationType=missense_variant GenomicCoord=12:25245347:C:G
>pgvar|ENST00000311189-2|HRAS VariantSource=COSMIC AAChange=p.Q61L MutationType=missense_variant GenomicCoord=12:25245350:A:T
>pgvar|ENST00000357654-1|BRCA1 VariantSource=ClinVar AAChange=p.R1699Q MutationType=missense_variant GenomicCoord=17:43094464:G:A dbSNP=rs41293455

# Multiple ORFs -- each ORF gets its own index, ORF number in metadata
>pgvar|ENST00000311189-3|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=1
>pgvar|ENST00000311189-4|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=2
>pgvar|ENST00000311189-5|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=3
```

### Indexing logic

The `-{index}` is per-transcript, per-file generation run:

- First variant on `ENST00000311189` gets `-1`
- Second variant on same transcript gets `-2`
- Multi-ORF outputs each consume an index (3 ORFs = 3 indices)
- First variant on a different transcript resets to `-1`

### Search engine compatibility

| Engine | Compatible | Notes |
|--------|-----------|-------|
| SearchGUI / PeptideShaker | Yes | Matches UniProt-like `db\|acc\|name` pattern |
| MaxQuant | Yes | Default UniProt parse rule works |
| MSFragger / FragPipe | Yes | Reads full header, splits on first whitespace |
| Comet | Yes | Parses `>db\|acc\|` natively |
| DIA-NN | Yes | Follows UniProt-style parsing |
| Proteome Discoverer | Yes | Supports pipe-delimited headers |

### Files to modify

| File | Change |
|------|--------|
| `pgatk/ensembl/ensembl.py` | Refactor `vcf_to_proteindb()` header construction (lines 661-664) |
| `pgatk/clinvar/clinvar_service.py` | Refactor header construction (lines 554-560) |
| `pgatk/cgenomes/cgenomes_proteindb.py` | Refactor COSMIC header (line 317), cBioPortal header |
| `pgatk/toolbox/vcf_utils.py` | Update `write_output()` to handle new format cleanly |
| `pgatk/config/` | Add constants for `PGVAR_PREFIX`, metadata keys |

### Design decisions

1. **Dash separator** (`-`) between transcript and index, consistent with UniProt isoform convention.
2. **No ORF suffix in accession** -- ORF number is metadata (`ORF=N`), each ORF gets its own index.
3. **Canonical proteins are pass-through** -- pgatk does not reformat existing database headers.
4. **Unified format across all sources** -- COSMIC, ClinVar, VCF variants all use `pgvar|` regardless of origin.
5. **Key=value metadata** in description field for structured downstream parsing.
Loading