-
Notifications
You must be signed in to change notification settings - Fork 13
Update README #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Update README #97
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,26 +1,178 @@ | ||
| # ProteoGenomics Analysis Toolkit | ||
| # pgatk -- ProteoGenomics Analysis Toolkit | ||
|
|
||
|  | ||
| [](http://bioconda.github.io/recipes/pgatk/README.html) | ||
| [](https://www.codacy.com/gh/bigbio/pgatk/dashboard?utm_source=github.com&utm_medium=referral&utm_content=bigbio/pgatk&utm_campaign=Badge_Grade) | ||
| [](https://badge.fury.io/py/pgatk) | ||
|  | ||
|
|
||
| **pgatk** is a Python library for proteogenomics data analysis. It provides bioinformatics tools to download, translate, and generate protein sequence databases from reference and mutation genome databases. | ||
| **pgatk** is a Python toolkit for building proteogenomics protein sequence databases. It downloads, translates, and combines variant and non-canonical sequences from multiple genomic sources into search-ready FASTA databases compatible with all major proteomics search engines. | ||
|
|
||
| ## Quick Install | ||
| ## Key Features | ||
|
|
||
| - **Multi-source variant integration** -- Translate variants from ENSEMBL, VCF files, COSMIC, cBioPortal, ClinVar, and gnomAD into protein sequences | ||
| - **Non-canonical ORF discovery** -- Three-frame and six-frame translation of lncRNAs, pseudogenes, antisense transcripts, and alternative reading frames | ||
| - **Any species** -- Supports all organisms available in ENSEMBL (human, mouse, rice, wheat, etc.) | ||
| - **Search engine compatible** -- Output FASTA files work with MaxQuant, SearchGUI, MSFragger, Comet, DIA-NN, and Proteome Discoverer | ||
| - **Decoy generation** -- Multiple target-decoy strategies (DecoyPYrat, protein-reverse, protein-shuffle) | ||
| - **Peptide-to-genome mapping** -- Map identified peptides back to genomic coordinates (GFF3) for genome browser visualization | ||
| - **ClinVar without VEP** -- ClinVar pipeline uses BedTools interval overlap, no VEP annotation required | ||
|
|
||
| ## Installation | ||
|
|
||
| ### pip (recommended) | ||
|
|
||
| ```bash | ||
| pip install pgatk | ||
| ``` | ||
|
|
||
| ### Bioconda | ||
|
|
||
| ```bash | ||
| conda install -c bioconda pgatk | ||
| ``` | ||
|
|
||
| ### From source | ||
|
|
||
| ```bash | ||
| git clone https://github.com/bigbio/pgatk.git | ||
| cd pgatk | ||
| pip install . | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| Build a human variant protein database in four commands: | ||
|
|
||
| ```bash | ||
| # 1. Download ENSEMBL data for human | ||
| pgatk ensembl-downloader -t 9606 -o ensembl_human | ||
|
|
||
| # 2. Extract transcript sequences (requires gffread) | ||
| gffread -F -w ensembl_human/transcripts.fa \ | ||
| -g ensembl_human/genome.fa \ | ||
| ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz | ||
|
|
||
| # 3. Translate variants to protein sequences | ||
| pgatk vcf-to-proteindb \ | ||
| --vcf ensembl_human/homo_sapiens_incl_consequences.vcf.gz \ | ||
| --input_fasta ensembl_human/transcripts.fa \ | ||
| --gene_annotations_gtf ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz \ | ||
| --output_proteindb variant_proteins.fa | ||
|
|
||
| # 4. Generate target-decoy database | ||
| pgatk generate-decoy \ | ||
| --input variant_proteins.fa \ | ||
| --output target_decoy.fa \ | ||
|
Comment on lines
+64
to
+66
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1. Wrong decoy cli flags README Quick Start uses --input/--output for pgatk generate-decoy, but the CLI only defines --input_database/--output_database (and -in/-out). Following the README will fail with Click “No such option” errors and blocks the Quick Start. Agent Prompt
|
||
| --method decoypyrat | ||
| ``` | ||
|
|
||
| ## Commands | ||
|
|
||
| ### Data Downloaders | ||
|
|
||
| | Command | Description | | ||
| |---------|-------------| | ||
| | `ensembl-downloader` | Download ENSEMBL reference data (GTF, FASTA, VCF) for any species by taxonomy ID | | ||
| | `ncbi-downloader` | Download NCBI RefSeq annotations and ClinVar VCF | | ||
| | `cosmic-downloader` | Download COSMIC somatic mutation data (requires account) | | ||
| | `cbioportal-downloader` | Download cBioPortal cancer genomics studies | | ||
|
|
||
| ### Variant-to-Protein Translation | ||
|
|
||
| | Command | Description | | ||
| |---------|-------------| | ||
| | `vcf-to-proteindb` | Translate VCF variants (ENSEMBL, gnomAD, patient WES/WGS) to protein sequences | | ||
| | `clinvar-to-proteindb` | Translate ClinVar clinical variants (no VEP required) | | ||
| | `cosmic-to-proteindb` | Translate COSMIC somatic mutations, with optional tissue-type splitting | | ||
| | `cbioportal-to-proteindb` | Translate cBioPortal study mutations to protein sequences | | ||
|
|
||
| ### Sequence Translation | ||
|
|
||
| | Command | Description | | ||
| |---------|-------------| | ||
| | `dnaseq-to-proteindb` | Translate DNA sequences with biotype filtering, multi-frame ORFs, and expression thresholds | | ||
| | `threeframe-translation` | Three-frame translation of transcript sequences | | ||
|
|
||
| ### Database Processing | ||
|
|
||
| | Command | Description | | ||
| |---------|-------------| | ||
| | `generate-decoy` | Generate decoy sequences (methods: `decoypyrat`, `protein-reverse`, `protein-shuffle`, `pgdbdeep`) | | ||
| | `ensembl-check` | Validate protein database -- filter short sequences, handle stop codons | | ||
|
|
||
| ### Post-Processing | ||
|
|
||
| | Command | Description | | ||
| |---------|-------------| | ||
| | `digest-mutant-protein` | In silico digest of variant proteins, filter against canonical proteome to extract unique peptides | | ||
| | `map-peptide2genome` | Map identified peptides to genomic coordinates (GFF3 output) | | ||
| | `spectrumai` | Inspect MS2 spectra of peptide identifications | | ||
| | `blast_get_position` | BLAST peptides against a reference database | | ||
|
|
||
| ## Supported Variant Sources | ||
|
|
||
| | Source | Command | Description | | ||
| |--------|---------|-------------| | ||
| | ENSEMBL | `vcf-to-proteindb` | Population variants (SNPs, indels) for any ENSEMBL species | | ||
| | gnomAD | `vcf-to-proteindb` | Ancestry-stratified population variants (AF_afr, AF_eas, AF_nfe, etc.) | | ||
| | ClinVar | `clinvar-to-proteindb` | Clinically annotated pathogenic/benign variants | | ||
| | COSMIC | `cosmic-to-proteindb` | Somatic cancer mutations, per tissue type or cell line | | ||
| | cBioPortal | `cbioportal-to-proteindb` | Cancer study mutations from TCGA, METABRIC, etc. | | ||
| | Custom VCF | `vcf-to-proteindb` | Patient WGS/WES variants from any variant caller (GATK, Strelka, MuTect2) | | ||
|
|
||
| ## Use Cases | ||
|
|
||
| Detailed end-to-end workflows are available in [docs/use-cases.md](docs/use-cases.md): | ||
|
|
||
| 1. **Cell-type specific non-canonical peptide discovery** -- Reproduce the analysis from Umer et al. 2022 | ||
| 2. **Human variant protein database** -- Standard ENSEMBL-based variant proteogenomics | ||
| 3. **Population-specific databases** -- gnomAD ancestry-stratified variant databases | ||
| 4. **ClinVar clinical variants** -- Clinical variant detection at the protein level | ||
| 5. **Cancer proteogenomics** -- COSMIC, cBioPortal, and patient-specific tumor databases | ||
| 6. **Novel ORF and micropeptide discovery** -- lncRNA, pseudogene, and alternative ORF translation | ||
| 7. **Genome annotation refinement** -- Six-frame translation and peptide-to-genome mapping | ||
| 8. **Metaproteomics** -- Six-frame translation of metagenome assemblies | ||
| 9. **Long-read transcriptomics** -- Isoform-resolved protein databases from PacBio/ONT data | ||
| 10. **Plant and non-model organisms** -- Proteogenomics for any ENSEMBL species | ||
|
|
||
| ## Project Structure | ||
|
|
||
| ``` | ||
| pgatk/ | ||
| ├── commands/ # CLI command definitions (Click) | ||
| ├── ensembl/ # ENSEMBL data download and VCF translation | ||
| ├── cgenomes/ # COSMIC and cBioPortal handling | ||
| ├── clinvar/ # ClinVar variant translation | ||
| ├── proteogenomics/ # Spectral validation tools | ||
| ├── proteomics/ # Protein database utilities (decoy generation) | ||
| ├── db/ # Peptide digestion and genome mapping | ||
| ├── config/ # YAML configuration files | ||
| └── toolbox/ # Shared utilities | ||
| ``` | ||
|
|
||
| ## Full Documentation | ||
|
|
||
| [https://pgatk.quantms.org](https://pgatk.quantms.org) | ||
|
|
||
| ## Cite as | ||
| ## Cite | ||
|
|
||
| If you use pgatk in your research, please cite: | ||
|
|
||
| > Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol. | ||
| > **Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.** | ||
| > *Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470--1472. | ||
| > [https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838) | ||
|
|
||
| ## Contributing | ||
|
|
||
| ```bash | ||
| git clone https://github.com/bigbio/pgatk.git | ||
| cd pgatk | ||
| pip install -e ".[dev]" | ||
| pytest | ||
| ``` | ||
|
|
||
| ## License | ||
|
|
||
| Husen M Umer, Enrique Audain, Yafeng Zhu, Julianus Pfeuffer, Timo Sachsenberg, Janne Lehtiö, Rui M Branca, Yasset Perez-Riverol. | ||
| Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides. | ||
| *Bioinformatics*, Volume 38, Issue 5, 1 March 2022, Pages 1470-1472. | ||
| [https://doi.org/10.1093/bioinformatics/btab838](https://doi.org/10.1093/bioinformatics/btab838) | ||
| Apache License 2.0 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # Protein Accession and FASTA Header Design | ||
|
|
||
| Issue: https://github.com/bigbio/pgatk/issues/18 | ||
| Branch: `feature/protein-accession-design` | ||
| Date: 2026-03-03 | ||
|
|
||
| ## Problem | ||
|
|
||
| Current pgatk FASTA headers are inconsistent across variant sources (VCF, COSMIC, ClinVar) and incompatible with major search engines like SearchGUI, which cannot parse ENSEMBL-style IDs. | ||
|
|
||
| ## Design | ||
|
|
||
| ### Two protein categories, two prefix strategies | ||
|
|
||
| | Category | Prefix | Accession | Description | | ||
| |----------|--------|-----------|-------------| | ||
| | Canonical (reference) | Keep original (`sp\|`, `tr\|`, `ensp\|`) | Original accession | Untouched from source database | | ||
| | Variant (mutated) | `pgvar\|` | `{TRANSCRIPT_ID}-{INDEX}` | pgatk-generated variant protein | | ||
|
|
||
| ### Variant header format | ||
|
|
||
| ``` | ||
| >pgvar|{TRANSCRIPT_ID}-{INDEX}|{GENE_SYMBOL} {key=value metadata} | ||
| ``` | ||
|
|
||
| **Fields:** | ||
|
|
||
| - `pgvar` -- database tag identifying pgatk-generated variant proteins. | ||
| - `{TRANSCRIPT_ID}-{INDEX}` -- accession composed of parent transcript ID and a dash-separated 1-based index (per transcript, per run). Mirrors UniProt isoform convention (`P12345-2`). | ||
| - `{GENE_SYMBOL}` -- gene name, first token after the second pipe. | ||
| - Metadata key=value pairs in the description field: | ||
|
|
||
| | Key | Description | Example | | ||
| |-----|-------------|---------| | ||
| | `VariantSource` | Origin database | `COSMIC`, `ClinVar`, `gnomAD`, `dbSNP` | | ||
| | `GenomicCoord` | `chr:pos:ref:alt` | `12:25245347:C:G` | | ||
| | `AAChange` | HGVS protein notation | `p.G13R` | | ||
| | `MutationType` | SO term or short label | `missense_variant` | | ||
| | `dbSNP` | rsID if available | `rs121913529` | | ||
| | `ORF` | Reading frame number (only when multi-ORF) | `1`, `2`, `3` | | ||
|
|
||
| ### Examples | ||
|
|
||
| ```fasta | ||
| # Canonical proteins -- untouched from source databases | ||
| >sp|P01112|RASH_HUMAN GTPase HRas OS=Homo sapiens | ||
| >ensp|ENSP00000309845|BRCA1 | ||
|
|
||
| # Variant proteins -- unified pgvar| prefix regardless of source | ||
| >pgvar|ENST00000311189-1|HRAS VariantSource=COSMIC AAChange=p.G13R MutationType=missense_variant GenomicCoord=12:25245347:C:G | ||
| >pgvar|ENST00000311189-2|HRAS VariantSource=COSMIC AAChange=p.Q61L MutationType=missense_variant GenomicCoord=12:25245350:A:T | ||
| >pgvar|ENST00000357654-1|BRCA1 VariantSource=ClinVar AAChange=p.R1699Q MutationType=missense_variant GenomicCoord=17:43094464:G:A dbSNP=rs41293455 | ||
|
|
||
| # Multiple ORFs -- each ORF gets its own index, ORF number in metadata | ||
| >pgvar|ENST00000311189-3|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=1 | ||
| >pgvar|ENST00000311189-4|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=2 | ||
| >pgvar|ENST00000311189-5|HRAS VariantSource=COSMIC AAChange=p.G13R ORF=3 | ||
| ``` | ||
|
|
||
| ### Indexing logic | ||
|
|
||
| The `-{index}` is per-transcript, per-file generation run: | ||
|
|
||
| - First variant on `ENST00000311189` gets `-1` | ||
| - Second variant on same transcript gets `-2` | ||
| - Multi-ORF outputs each consume an index (3 ORFs = 3 indices) | ||
| - First variant on a different transcript resets to `-1` | ||
|
|
||
| ### Search engine compatibility | ||
|
|
||
| | Engine | Compatible | Notes | | ||
| |--------|-----------|-------| | ||
| | SearchGUI / PeptideShaker | Yes | Matches UniProt-like `db\|acc\|name` pattern | | ||
| | MaxQuant | Yes | Default UniProt parse rule works | | ||
| | MSFragger / FragPipe | Yes | Reads full header, splits on first whitespace | | ||
| | Comet | Yes | Parses `>db\|acc\|` natively | | ||
| | DIA-NN | Yes | Follows UniProt-style parsing | | ||
| | Proteome Discoverer | Yes | Supports pipe-delimited headers | | ||
|
|
||
| ### Files to modify | ||
|
|
||
| | File | Change | | ||
| |------|--------| | ||
| | `pgatk/ensembl/ensembl.py` | Refactor `vcf_to_proteindb()` header construction (lines 661-664) | | ||
| | `pgatk/clinvar/clinvar_service.py` | Refactor header construction (lines 554-560) | | ||
| | `pgatk/cgenomes/cgenomes_proteindb.py` | Refactor COSMIC header (line 317), cBioPortal header | | ||
| | `pgatk/toolbox/vcf_utils.py` | Update `write_output()` to handle new format cleanly | | ||
| | `pgatk/config/` | Add constants for `PGVAR_PREFIX`, metadata keys | | ||
|
|
||
| ### Design decisions | ||
|
|
||
| 1. **Dash separator** (`-`) between transcript and index, consistent with UniProt isoform convention. | ||
| 2. **No ORF suffix in accession** -- ORF number is metadata (`ORF=N`), each ORF gets its own index. | ||
| 3. **Canonical proteins are pass-through** -- pgatk does not reformat existing database headers. | ||
| 4. **Unified format across all sources** -- COSMIC, ClinVar, VCF variants all use `pgvar|` regardless of origin. | ||
| 5. **Key=value metadata** in description field for structured downstream parsing. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Genome.fa not produced
🐞 Bug⛯ ReliabilityAgent Prompt
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools