bigbio · ypriverol · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/README.md b/README.md
@@ -1,26 +1,178 @@
-# ProteoGenomics Analysis Toolkit
+# pgatk -- ProteoGenomics Analysis Toolkit
 
 ![Python application](https://github.com/bigbio/pgatk/workflows/Python%20application/badge.svg)
 [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pgatk/README.html)
 [![Codacy Badge](https://app.codacy.com/project/badge/Grade/f6d030fd7d69413987f7265a01193324)](https://www.codacy.com/gh/bigbio/pgatk/dashboard?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=bigbio/pgatk&amp;utm_campaign=Badge_Grade)
 [![PyPI version](https://badge.fury.io/py/pgatk.svg)](https://badge.fury.io/py/pgatk)
 ![PyPI - Downloads](https://img.shields.io/pypi/dm/pgatk)
 
-**pgatk** is a Python library for proteogenomics data analysis. It provides bioinformatics tools to download, translate, and generate protein sequence databases from reference and mutation genome databases.
+**pgatk** is a Python toolkit for building proteogenomics protein sequence databases. It downloads, translates, and combines variant and non-canonical sequences from multiple genomic sources into search-ready FASTA databases compatible with all major proteomics search engines.
 
-## Quick Install
+## Key Features
+
+- **Multi-source variant integration** -- Translate variants from ENSEMBL, VCF files, COSMIC, cBioPortal, ClinVar, and gnomAD into protein sequences
+- **Non-canonical ORF discovery** -- Three-frame and six-frame translation of lncRNAs, pseudogenes, antisense transcripts, and alternative reading frames
+- **Any species** -- Supports all organisms available in ENSEMBL (human, mouse, rice, wheat, etc.)
+- **Search engine compatible** -- Output FASTA files work with MaxQuant, SearchGUI, MSFragger, Comet, DIA-NN, and Proteome Discoverer
+- **Decoy generation** -- Multiple target-decoy strategies (DecoyPYrat, protein-reverse, protein-shuffle)
+- **Peptide-to-genome mapping** -- Map identified peptides back to genomic coordinates (GFF3) for genome browser visualization
+- **ClinVar without VEP** -- ClinVar pipeline uses BedTools interval overlap, no VEP annotation required
+
+## Installation
+
+### pip (recommended)
 
 ```bash
 pip install pgatk
 ```
 
+### Bioconda
+
+```bash
+conda install -c bioconda pgatk
+```
+
+### From source
+
+```bash
+git clone https://github.com/bigbio/pgatk.git
+cd pgatk
+pip install .
+```
+
+## Quick Start
+
+Build a human variant protein database in four commands:
+
+```bash
+# 1. Download ENSEMBL data for human
+pgatk ensembl-downloader -t 9606 -o ensembl_human
+
+# 2. Extract transcript sequences (requires gffread)
+gffread -F -w ensembl_human/transcripts.fa \
+    -g ensembl_human/genome.fa \
+    ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz
+
+# 3. Translate variants to protein sequences
+pgatk vcf-to-proteindb \
+    --vcf ensembl_human/homo_sapiens_incl_consequences.vcf.gz \
+    --input_fasta ensembl_human/transcripts.fa \
+    --gene_annotations_gtf ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz \
+    --output_proteindb variant_proteins.fa
+
+# 4. Generate target-decoy database
+pgatk generate-decoy \
+    --input variant_proteins.fa \
+    --output target_decoy.fa \
+    --method decoypyrat
+```
+
+## Commands
+
+### Data Downloaders
+
+| Command | Description |
+|---------|-------------|
+| `ensembl-downloader` | Download ENSEMBL reference data (GTF, FASTA, VCF) for any species by taxonomy ID |
+| `ncbi-downloader` | Download NCBI RefSeq annotations and ClinVar VCF |
+| `cosmic-downloader` | Download COSMIC somatic mutation data (requires account) |
+| `cbioportal-downloader` | Download cBioPortal cancer genomics studies |
+
+### Variant-to-Protein Translation
+
+| Command | Description |
+|---------|-------------|
+| `vcf-to-proteindb` | Translate VCF variants (ENSEMBL, gnomAD, patient WES/WGS) to protein sequences |
+| `clinvar-to-proteindb` | Translate ClinVar clinical variants (no VEP required) |
+| `cosmic-to-proteindb` | Translate COSMIC somatic mutations, with optional tissue-type splitting |
+| `cbioportal-to-proteindb` | Translate cBioPortal study mutations to protein sequences |
+
+### Sequence Translation
+
+| Command | Description |
+|---------|-------------|
+| `dnaseq-to-proteindb` | Translate DNA sequences with biotype filtering, multi-frame ORFs, and expression thresholds |
+| `threeframe-translation` | Three-frame translation of transcript sequences |
+
+### Database Processing
+
+| Command | Description |
+|---------|-------------|
+| `generate-decoy` | Generate decoy sequences (methods: `decoypyrat`, `protein-reverse`, `protein-shuffle`, `pgdbdeep`) |
+| `ensembl-check` | Validate protein database -- filter short sequences, handle stop codons |
+
+### Post-Processing
+
+| Command | Description |
+|---------|-------------|
+| `digest-mutant-protein` | In silico digest of variant proteins, filter against canonical proteome to extract unique peptides |
+| `map-peptide2genome` | Map identified peptides to genomic coordinates (GFF3 output) |
+| `spectrumai` | Inspect MS2 spectra of peptide identifications |
+| `blast_get_position` | BLAST peptides against a reference database |
+
+## Supported Variant Sources
+
+| Source | Command | Description |
+|--------|---------|-------------|
+| ENSEMBL | `vcf-to-proteindb` | Population variants (SNPs, indels) for any ENSEMBL species |
+| gnomAD | `vcf-to-proteindb` | Ancestry-stratified population variants (AF_afr, AF_eas, AF_nfe, etc.) |
+| ClinVar | `clinvar-to-proteindb` | Clinically annotated pathogenic/benign variants |
+| COSMIC | `cosmic-to-proteindb` | Somatic cancer mutations, per tissue type or cell line |
+| cBioPortal | `cbioportal-to-proteindb` | Cancer study mutations from TCGA, METABRIC, etc. |
+| Custom VCF | `vcf-to-proteindb` | Patient WGS/WES variants from any variant caller (GATK, Strelka, MuTect2) |
+
+## Use Cases
+
+Detailed end-to-end workflows are available in [docs/use-cases.md](docs/use-cases.md):
+
+1. **Cell-type specific non-canonical peptide discovery** -- Reproduce the analysis from Umer et al. 2022
+2. **Human variant protein database** -- Standard ENSEMBL-based variant proteogenomics
+3. **Population-specific databases** -- gnomAD ancestry-stratified variant databases
+4. **ClinVar clinical variants** -- Clinical variant detection at the protein level
+5. **Cancer proteogenomics** -- COSMIC, cBioPortal, and patient-specific tumor databases
+6. **Novel ORF and micropeptide discovery** -- lncRNA, pseudogene, and alternative ORF translation
+7. **Genome annotation refinement** -- Six-frame translation and peptide-to-genome mapping
+8. **Metaproteomics** -- Six-frame translation of metagenome assemblies
+9. **Long-read transcriptomics** -- Isoform-resolved protein databases from PacBio/ONT data
+10. **Plant and non-model organisms** -- Proteogenomics for any ENSEMBL species
+
+## Project Structure
+
+```
+pgatk/
+├── commands/           # CLI command definitions (Click)
+├── ensembl/            # ENSEMBL data download and VCF translation
+├── cgenomes/           # COSMIC and cBioPortal handling
+├── clinvar/            # ClinVar variant translation
+├── proteogenomics/     # Spectral validation tools
+├── proteomics/         # Protein database utilities (decoy generation)
+├── db/                 # Peptide digestion and genome mapping
+├── config/             # YAML configuration files
+└── toolbox/            # Shared utilities
+```
+
 ## Full Documentation
 
 [https://pgatk.quantms.org](https://pgatk.quantms.org)
 
-## Cite as
+## Cite
+
+If you use pgatk in your research, please cite:
+
+> Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol.
+> **Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.**
+> *Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470--1472.
+> [https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838)
+
+## Contributing
+
+```bash
+git clone https://github.com/bigbio/pgatk.git
+cd pgatk
+pip install -e ".[dev]"
+pytest
+```
+
+## License
 
-Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol.
-Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.
-*Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470-1472.
-[https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838)
+Apache License 2.0
diff --git a/docs/plans/2026-03-03-protein-accession-design.md b/docs/plans/2026-03-03-protein-accession-design.md
@@ -0,0 +1,96 @@
+# Protein Accession and FASTA Header Design
+
+Issue: https://github.com/bigbio/pgatk/issues/18
+Branch: `feature/protein-accession-design`
+Date: 2026-03-03
+
+## Problem
+
+Current pgatk FASTA headers are inconsistent across variant sources (VCF, COSMIC, ClinVar) and incompatible with major search engines like SearchGUI, which cannot parse ENSEMBL-style IDs.
+
+## Design
+
+### Two protein categories, two prefix strategies
+
+| Category | Prefix | Accession | Description |
+|----------|--------|-----------|-------------|
+| Canonical (reference) | Keep original (`sp\|`, `tr\|`, `ensp\|`) | Original accession | Untouched from source database |
+| Variant (mutated) | `pgvar\|` | `{TRANSCRIPT_ID}-{INDEX}` | pgatk-generated variant protein |
+
+### Variant header format
+
+```
+>pgvar|{TRANSCRIPT_ID}-{INDEX}|{GENE_SYMBOL} {key=value metadata}
+```
+
+**Fields:**
+
+- `pgvar` -- database tag identifying pgatk-generated variant proteins.
+- `{TRANSCRIPT_ID}-{INDEX}` -- accession composed of parent transcript ID and a dash-separated 1-based index (per transcript, per run). Mirrors UniProt isoform convention (`P12345-2`).
+- `{GENE_SYMBOL}` -- gene name, first token after the second pipe.
+- Metadata key=value pairs in the description field:
+
+| Key | Description | Example |
+|-----|-------------|---------|
+| `VariantSource` | Origin database | `COSMIC`, `ClinVar`, `gnomAD`, `dbSNP` |
+| `GenomicCoord` | `chr:pos:ref:alt` | `12:25245347:C:G` |
+| `AAChange` | HGVS protein notation | `p.G13R` |
+| `MutationType` | SO term or short label | `missense_variant` |
+| `dbSNP` | rsID if available | `rs121913529` |
+| `ORF` | Reading frame number (only when multi-ORF) | `1`, `2`, `3` |
+
+### Examples
+
+```fasta
+# Canonical proteins -- untouched from source databases
+>sp|P01112|RASH_HUMAN GTPase HRas OS=Homo sapiens
+>ensp|ENSP00000309845|BRCA1
+
+# Variant proteins -- unified pgvar| prefix regardless of source
+>pgvar|ENST00000311189-1|HRAS VariantSource=COSMIC AAChange=p.G13R MutationType=missense_variant GenomicCoord=12:25245347:C:G
+>pgvar|ENST00000311189-2|HRAS VariantSource=COSMIC AAChange=p.Q61L MutationType=missense_variant GenomicCoord=12:25245350:A:T
+>pgvar|ENST00000357654-1|BRCA1 VariantSource=ClinVar AAChange=p.R1699Q MutationType=missense_variant GenomicCoord=17:43094464:G:A dbSNP=rs41293455
+
+# Multiple ORFs -- each ORF gets its own index, ORF number in metadata
+>pgvar|ENST00000311189-3|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=1
+>pgvar|ENST00000311189-4|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=2
+>pgvar|ENST00000311189-5|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=3
+```
+
+### Indexing logic
+
+The `-{index}` is per-transcript, per-file generation run:
+
+- First variant on `ENST00000311189` gets `-1`
+- Second variant on same transcript gets `-2`
+- Multi-ORF outputs each consume an index (3 ORFs = 3 indices)
+- First variant on a different transcript resets to `-1`
+
+### Search engine compatibility
+
+| Engine | Compatible | Notes |
+|--------|-----------|-------|
+| SearchGUI / PeptideShaker | Yes | Matches UniProt-like `db\|acc\|name` pattern |
+| MaxQuant | Yes | Default UniProt parse rule works |
+| MSFragger / FragPipe | Yes | Reads full header, splits on first whitespace |
+| Comet | Yes | Parses `>db\|acc\|` natively |
+| DIA-NN | Yes | Follows UniProt-style parsing |
+| Proteome Discoverer | Yes | Supports pipe-delimited headers |
+
+### Files to modify
+
+| File | Change |
+|------|--------|
+| `pgatk/ensembl/ensembl.py` | Refactor `vcf_to_proteindb()` header construction (lines 661-664) |
+| `pgatk/clinvar/clinvar_service.py` | Refactor header construction (lines 554-560) |
+| `pgatk/cgenomes/cgenomes_proteindb.py` | Refactor COSMIC header (line 317), cBioPortal header |
+| `pgatk/toolbox/vcf_utils.py` | Update `write_output()` to handle new format cleanly |
+| `pgatk/config/` | Add constants for `PGVAR_PREFIX`, metadata keys |
+
+### Design decisions
+
+1. **Dash separator** (`-`) between transcript and index, consistent with UniProt isoform convention.
+2. **No ORF suffix in accession** -- ORF number is metadata (`ORF=N`), each ORF gets its own index.
+3. **Canonical proteins are pass-through** -- pgatk does not reformat existing database headers.
+4. **Unified format across all sources** -- COSMIC, ClinVar, VCF variants all use `pgvar|` regardless of origin.
+5. **Key=value metadata** in description field for structured downstream parsing.