Update README by ypriverol · Pull Request #97 · bigbio/pgatk

ypriverol · 2026-03-04T08:01:31Z

Summary by CodeRabbit

Documentation
- Significantly enhanced README with comprehensive toolkit description, detailed feature set overview, multiple installation options, quick-start workflow guide, structured command groups, supported variant sources, and use cases.
- Added design documentation for protein accession and FASTA header format specifications with indexing logic and search engine compatibility information.

Design for issue #18: unified pgvar|transcript-index|gene format for variant proteins, compatible with all major search engines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Expands the minimal README with key features, installation methods, quick start example, full command reference table, supported variant sources, use case index, and project structure overview. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update README

qodo-code-review · 2026-03-04T08:01:47Z

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Expand README with comprehensive documentation and protein accession design

📝 Documentation ✨ Enhancement

Walkthroughs

Description

• Comprehensive README expansion with features, installation, quick start guide
• Added detailed command reference tables for all CLI tools and workflows
• Documented supported variant sources and 10 end-to-end use cases
• Added protein accession design document for unified FASTA header format
• Included project structure overview and contribution guidelines

Diagram

flowchart LR
  A["README.md<br/>Minimal content"] -->|"Add features,<br/>installation,<br/>quick start"| B["README.md<br/>Comprehensive guide"]
  B -->|"Include command<br/>reference tables"| C["README.md<br/>Full documentation"]
  C -->|"Add use cases<br/>and structure"| D["Complete<br/>documentation"]
  E["Design doc<br/>Issue #18"] -->|"Protein accession<br/>and FASTA header<br/>format"| F["protein-accession-design.md<br/>Unified variant format"]

File Changes

1. README.md 📝 Documentation +160/-8

Comprehensive README with features and command reference

• Expanded from minimal 26 lines to comprehensive 178-line documentation
• Added Key Features section highlighting multi-source variant integration, non-canonical ORF
 discovery, and search engine compatibility
• Included three installation methods (pip, bioconda, from source) with code examples
• Added Quick Start section with four-command workflow for building human variant protein database
• Created command reference tables organized by category (Data Downloaders, Variant-to-Protein
 Translation, Sequence Translation, Database Processing, Post-Processing)
• Added Supported Variant Sources table covering ENSEMBL, gnomAD, ClinVar, COSMIC, cBioPortal, and
 custom VCF
• Included Use Cases section with 10 detailed end-to-end workflows and link to docs/use-cases.md
• Added Project Structure section showing directory organization
• Expanded citation section with full bibliographic details
• Added Contributing and License sections

README.md

2. docs/plans/2026-03-03-protein-accession-design.md 📝 Documentation +96/-0

Protein accession and FASTA header design specification

• New design document addressing issue #18 for unified protein accession format
• Defined two-category prefix strategy: canonical proteins keep original headers, variant proteins
 use pgvar| prefix
• Specified variant header format as >pgvar|{TRANSCRIPT_ID}-{INDEX}|{GENE_SYMBOL} {metadata}
• Documented metadata key-value pairs (VariantSource, GenomicCoord, AAChange, MutationType, dbSNP,
 ORF)
• Provided concrete examples of canonical and variant protein headers across COSMIC, ClinVar, and
 multi-ORF scenarios
• Explained per-transcript, per-run indexing logic for variant protein numbering
• Verified compatibility with major search engines (SearchGUI, MaxQuant, MSFragger, Comet, DIA-NN,
 Proteome Discoverer)
• Listed specific files requiring modification and design rationale for implementation decisions

docs/plans/2026-03-03-protein-accession-design.md

qodo-code-review · 2026-03-04T08:01:48Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. Wrong decoy CLI flags 🐞 Bug ✓ Correctness

Description

README Quick Start uses --input/--output for pgatk generate-decoy, but the CLI only defines
--input_database/--output_database (and -in/-out). Following the README will fail with Click “No
such option” errors and blocks the Quick Start.

Code

README.md[R64-66]

+pgatk generate-decoy \
+    --input variant_proteins.fa \
+    --output target_decoy.fa \

Evidence
The README documents flags that do not exist on the actual Click command; the real CLI options are
--input_database and --output_database.
README.md[63-67]
pgatk/commands/proteindb_decoy.py[12-17]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
README Quick Start documents `pgatk generate-decoy` with `--input` and `--output`, but the CLI only supports `--input_database` / `--output_database` (and `-in` / `-out`). Users will hit a Click error and cannot complete the Quick Start.

## Issue Context
The Click command is defined in `pgatk/commands/proteindb_decoy.py` and does not include `--input`/`--output` aliases.

## Fix Focus Areas
- README.md[63-67]
- pgatk/commands/proteindb_decoy.py[12-17] (optional: if adding aliases)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Genome.fa not produced 🐞 Bug ⛯ Reliability

Description

README Quick Start tells users to run gffread with -g ensembl_human/genome.fa, but
ensembl-downloader downloads the genome as a versioned *.dna_sm.toplevel.fa.gz file and the
codebase does not create a genome.fa convenience file. The Quick Start will fail unless the user
manually renames/decompresses, which is not documented.

Code

README.md[R52-54]

+gffread -F -w ensembl_human/transcripts.fa \
+    -g ensembl_human/genome.fa \
+    ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz

Evidence

The README references ensembl_human/genome.fa, but the downloader writes the genome assembly as
{Species}.{Assembly}.dna_sm.toplevel.fa.gz into the output directory; there is no code path that
creates genome.fa.

README.md[51-54]
pgatk/ensembl/data_downloader.py[474-483]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
README Quick Start uses `ensembl_human/genome.fa`, but `ensembl-downloader` saves the genome as a versioned `*.dna_sm.toplevel.fa.gz`. Users following the README will not find `genome.fa`.

## Issue Context
Downloader code constructs the genome filename as `{Species}.{Assembly}.dna_sm.toplevel.fa.gz` and downloads it directly.

## Fix Focus Areas
- README.md[51-55]
- pgatk/ensembl/data_downloader.py[474-483] (reference for actual filename pattern)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Human VCF may be per-chrom 🐞 Bug ⛯ Reliability

Description

README Quick Start assumes a single homo_sapiens_incl_consequences.vcf.gz file exists after
ensembl-downloader, but the downloader has a fallback that downloads per-chromosome VCFs when the
combined file is unavailable and does not combine them. In that case, the README’s VCF path won’t
exist and the workflow breaks.

Code

README.md[R57-60]

+pgatk vcf-to-proteindb \
+    --vcf ensembl_human/homo_sapiens_incl_consequences.vcf.gz \
+    --input_fasta ensembl_human/transcripts.fa \
+    --gene_annotations_gtf ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz \

Evidence

Downloader first tries to fetch a combined VCF; if that download fails for Homo sapiens, it
downloads chromosome-specific VCFs (-chr1, -chr2, …) and returns them without any combine step,
while README always references the combined filename.

README.md[56-61]
pgatk/ensembl/data_downloader.py[418-454]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
README assumes `ensembl_human/homo_sapiens_incl_consequences.vcf.gz` exists. But `ensembl-downloader` may download per-chromosome VCFs for Homo sapiens when the combined file is unavailable, and it does not combine them.

## Issue Context
The downloader comment says it will combine, but no combine step is present in the shown logic.

## Fix Focus Areas
- README.md[56-61]
- pgatk/ensembl/data_downloader.py[418-454]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

coderabbitai · 2026-03-04T08:01:56Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a6c6c75-239b-4114-9b47-9ab2f72f4b0a

📥 Commits

Reviewing files that changed from the base of the PR and between 39fee83 and 484690d.

📒 Files selected for processing (2)

README.md
docs/plans/2026-03-03-protein-accession-design.md

📝 Walkthrough

Walkthrough

Updated README to redefine pgatk as a Python toolkit for building proteogenomics protein sequence databases with features, installation options, and quick start workflow. Added new design documentation specifying protein accession formats and FASTA header conventions with variant prefix strategies and metadata requirements.

Changes

Cohort / File(s)	Summary
README Documentation `README.md`	Comprehensive restructuring: redefined toolkit description, added feature set (multi-source variant integration, non-canonical ORFs, search engine compatibility), expanded installation instructions, introduced quick start workflow, structured command groups, and added supported variant sources and use cases sections.
Design Specification `docs/plans/2026-03-03-protein-accession-design.md`	New design documentation for protein accession and FASTA header format, specifying variant prefix strategy (`pgvar` prefix), metadata fields (VariantSource, GenomicCoord, AAChange, MutationType, dbSNP, ORF), indexing logic per transcript and per ORF, and search engine compatibility details with concrete header examples.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐰 With whiskers twitching, I review with glee,
A toolkit for proteogenomics spree,
Variants and sequences, headers redesigned,
Documentation flows with structured mind,
CodeRabbit's edits make knowledge shine,
Building tools that work just fine! 🧬

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-03-04T08:06:07Z

+pgatk generate-decoy \
+    --input variant_proteins.fa \
+    --output target_decoy.fa \


1. Wrong decoy cli flags 🐞 Bug ✓ Correctness

README Quick Start uses --input/--output for pgatk generate-decoy, but the CLI only defines --input_database/--output_database (and -in/-out). Following the README will fail with Click “No such option” errors and blocks the Quick Start.

Agent Prompt

## Issue description README Quick Start documents `pgatk generate-decoy` with `--input` and `--output`, but the CLI only supports `--input_database` / `--output_database` (and `-in` / `-out`). Users will hit a Click error and cannot complete the Quick Start. ## Issue Context The Click command is defined in `pgatk/commands/proteindb_decoy.py` and does not include `--input`/`--output` aliases. ## Fix Focus Areas - README.md[63-67] - pgatk/commands/proteindb_decoy.py[12-17] (optional: if adding aliases)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-03-04T08:06:07Z

+gffread -F -w ensembl_human/transcripts.fa \
+    -g ensembl_human/genome.fa \
+    ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz


2. Genome.fa not produced 🐞 Bug ⛯ Reliability

README Quick Start tells users to run gffread with -g ensembl_human/genome.fa, but ensembl-downloader downloads the genome as a versioned *.dna_sm.toplevel.fa.gz file and the codebase does not create a genome.fa convenience file. The Quick Start will fail unless the user manually renames/decompresses, which is not documented.

Agent Prompt

## Issue description README Quick Start uses `ensembl_human/genome.fa`, but `ensembl-downloader` saves the genome as a versioned `*.dna_sm.toplevel.fa.gz`. Users following the README will not find `genome.fa`. ## Issue Context Downloader code constructs the genome filename as `{Species}.{Assembly}.dna_sm.toplevel.fa.gz` and downloads it directly. ## Fix Focus Areas - README.md[51-55] - pgatk/ensembl/data_downloader.py[474-483] (reference for actual filename pattern)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ypriverol and others added 3 commits March 4, 2026 07:04

Add protein accession and FASTA header design doc

f3dbdc5

Design for issue #18: unified pgvar|transcript-index|gene format for variant proteins, compatible with all major search engines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #96 from bigbio/feature/protein-accession-design

484690d

Update README

ypriverol merged commit b9f8d17 into master Mar 4, 2026
1 of 5 checks passed

qodo-code-review Bot reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update README#97

Update README#97
ypriverol merged 3 commits into
masterfrom
dev

ypriverol commented Mar 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-code-review Bot commented Mar 4, 2026

Uh oh!

qodo-code-review Bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 4, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

qodo-code-review Bot Mar 4, 2026

Uh oh!

qodo-code-review Bot Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ypriverol commented Mar 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented Mar 4, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

coderabbitai Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

qodo-code-review Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ypriverol commented Mar 4, 2026 •

edited by coderabbitai Bot

Loading

qodo-code-review Bot commented Mar 4, 2026 •

edited

Loading

coderabbitai Bot commented Mar 4, 2026 •

edited

Loading