Skip to content

Update README#97

Merged
ypriverol merged 3 commits into
masterfrom
dev
Mar 4, 2026
Merged

Update README#97
ypriverol merged 3 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Mar 4, 2026

Summary by CodeRabbit

  • Documentation
    • Significantly enhanced README with comprehensive toolkit description, detailed feature set overview, multiple installation options, quick-start workflow guide, structured command groups, supported variant sources, and use cases.
    • Added design documentation for protein accession and FASTA header format specifications with indexing logic and search engine compatibility information.

ypriverol and others added 3 commits March 4, 2026 07:04
Design for issue #18: unified pgvar|transcript-index|gene format
for variant proteins, compatible with all major search engines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expands the minimal README with key features, installation methods,
quick start example, full command reference table, supported variant
sources, use case index, and project structure overview.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Expand README with comprehensive documentation and protein accession design

📝 Documentation ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Comprehensive README expansion with features, installation, quick start guide
• Added detailed command reference tables for all CLI tools and workflows
• Documented supported variant sources and 10 end-to-end use cases
• Added protein accession design document for unified FASTA header format
• Included project structure overview and contribution guidelines
Diagram
flowchart LR
  A["README.md<br/>Minimal content"] -->|"Add features,<br/>installation,<br/>quick start"| B["README.md<br/>Comprehensive guide"]
  B -->|"Include command<br/>reference tables"| C["README.md<br/>Full documentation"]
  C -->|"Add use cases<br/>and structure"| D["Complete<br/>documentation"]
  E["Design doc<br/>Issue #18"] -->|"Protein accession<br/>and FASTA header<br/>format"| F["protein-accession-design.md<br/>Unified variant format"]
Loading

Grey Divider

File Changes

1. README.md 📝 Documentation +160/-8

Comprehensive README with features and command reference

• Expanded from minimal 26 lines to comprehensive 178-line documentation
• Added Key Features section highlighting multi-source variant integration, non-canonical ORF
 discovery, and search engine compatibility
• Included three installation methods (pip, bioconda, from source) with code examples
• Added Quick Start section with four-command workflow for building human variant protein database
• Created command reference tables organized by category (Data Downloaders, Variant-to-Protein
 Translation, Sequence Translation, Database Processing, Post-Processing)
• Added Supported Variant Sources table covering ENSEMBL, gnomAD, ClinVar, COSMIC, cBioPortal, and
 custom VCF
• Included Use Cases section with 10 detailed end-to-end workflows and link to docs/use-cases.md
• Added Project Structure section showing directory organization
• Expanded citation section with full bibliographic details
• Added Contributing and License sections

README.md


2. docs/plans/2026-03-03-protein-accession-design.md 📝 Documentation +96/-0

Protein accession and FASTA header design specification

• New design document addressing issue #18 for unified protein accession format
• Defined two-category prefix strategy: canonical proteins keep original headers, variant proteins
 use pgvar| prefix
• Specified variant header format as >pgvar|{TRANSCRIPT_ID}-{INDEX}|{GENE_SYMBOL} {metadata}
• Documented metadata key-value pairs (VariantSource, GenomicCoord, AAChange, MutationType, dbSNP,
 ORF)
• Provided concrete examples of canonical and variant protein headers across COSMIC, ClinVar, and
 multi-ORF scenarios
• Explained per-transcript, per-run indexing logic for variant protein numbering
• Verified compatibility with major search engines (SearchGUI, MaxQuant, MSFragger, Comet, DIA-NN,
 Proteome Discoverer)
• Listed specific files requiring modification and design rationale for implementation decisions

docs/plans/2026-03-03-protein-accession-design.md


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Mar 4, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Wrong decoy CLI flags 🐞 Bug ✓ Correctness
Description
README Quick Start uses --input/--output for pgatk generate-decoy, but the CLI only defines
--input_database/--output_database (and -in/-out). Following the README will fail with Click “No
such option” errors and blocks the Quick Start.
Code

README.md[R64-66]

+pgatk generate-decoy \
+    --input variant_proteins.fa \
+    --output target_decoy.fa \
Evidence
The README documents flags that do not exist on the actual Click command; the real CLI options are
--input_database and --output_database.

README.md[63-67]
pgatk/commands/proteindb_decoy.py[12-17]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
README Quick Start documents `pgatk generate-decoy` with `--input` and `--output`, but the CLI only supports `--input_database` / `--output_database` (and `-in` / `-out`). Users will hit a Click error and cannot complete the Quick Start.

## Issue Context
The Click command is defined in `pgatk/commands/proteindb_decoy.py` and does not include `--input`/`--output` aliases.

## Fix Focus Areas
- README.md[63-67]
- pgatk/commands/proteindb_decoy.py[12-17] (optional: if adding aliases)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Genome.fa not produced 🐞 Bug ⛯ Reliability
Description
README Quick Start tells users to run gffread with -g ensembl_human/genome.fa, but
ensembl-downloader downloads the genome as a versioned *.dna_sm.toplevel.fa.gz file and the
codebase does not create a genome.fa convenience file. The Quick Start will fail unless the user
manually renames/decompresses, which is not documented.
Code

README.md[R52-54]

+gffread -F -w ensembl_human/transcripts.fa \
+    -g ensembl_human/genome.fa \
+    ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz
Evidence
The README references ensembl_human/genome.fa, but the downloader writes the genome assembly as
{Species}.{Assembly}.dna_sm.toplevel.fa.gz into the output directory; there is no code path that
creates genome.fa.

README.md[51-54]
pgatk/ensembl/data_downloader.py[474-483]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
README Quick Start uses `ensembl_human/genome.fa`, but `ensembl-downloader` saves the genome as a versioned `*.dna_sm.toplevel.fa.gz`. Users following the README will not find `genome.fa`.

## Issue Context
Downloader code constructs the genome filename as `{Species}.{Assembly}.dna_sm.toplevel.fa.gz` and downloads it directly.

## Fix Focus Areas
- README.md[51-55]
- pgatk/ensembl/data_downloader.py[474-483] (reference for actual filename pattern)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Human VCF may be per-chrom 🐞 Bug ⛯ Reliability
Description
README Quick Start assumes a single homo_sapiens_incl_consequences.vcf.gz file exists after
ensembl-downloader, but the downloader has a fallback that downloads per-chromosome VCFs when the
combined file is unavailable and does not combine them. In that case, the README’s VCF path won’t
exist and the workflow breaks.
Code

README.md[R57-60]

+pgatk vcf-to-proteindb \
+    --vcf ensembl_human/homo_sapiens_incl_consequences.vcf.gz \
+    --input_fasta ensembl_human/transcripts.fa \
+    --gene_annotations_gtf ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz \
Evidence
Downloader first tries to fetch a combined VCF; if that download fails for Homo sapiens, it
downloads chromosome-specific VCFs (-chr1, -chr2, …) and returns them without any combine step,
while README always references the combined filename.

README.md[56-61]
pgatk/ensembl/data_downloader.py[418-454]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
README assumes `ensembl_human/homo_sapiens_incl_consequences.vcf.gz` exists. But `ensembl-downloader` may download per-chromosome VCFs for Homo sapiens when the combined file is unavailable, and it does not combine them.

## Issue Context
The downloader comment says it will combine, but no combine step is present in the shown logic.

## Fix Focus Areas
- README.md[56-61]
- pgatk/ensembl/data_downloader.py[418-454]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 4, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a6c6c75-239b-4114-9b47-9ab2f72f4b0a

📥 Commits

Reviewing files that changed from the base of the PR and between 39fee83 and 484690d.

📒 Files selected for processing (2)
  • README.md
  • docs/plans/2026-03-03-protein-accession-design.md

📝 Walkthrough

Walkthrough

Updated README to redefine pgatk as a Python toolkit for building proteogenomics protein sequence databases with features, installation options, and quick start workflow. Added new design documentation specifying protein accession formats and FASTA header conventions with variant prefix strategies and metadata requirements.

Changes

Cohort / File(s) Summary
README Documentation
README.md
Comprehensive restructuring: redefined toolkit description, added feature set (multi-source variant integration, non-canonical ORFs, search engine compatibility), expanded installation instructions, introduced quick start workflow, structured command groups, and added supported variant sources and use cases sections.
Design Specification
docs/plans/2026-03-03-protein-accession-design.md
New design documentation for protein accession and FASTA header format, specifying variant prefix strategy (pgvar prefix), metadata fields (VariantSource, GenomicCoord, AAChange, MutationType, dbSNP, ORF), indexing logic per transcript and per ORF, and search engine compatibility details with concrete header examples.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐰 With whiskers twitching, I review with glee,
A toolkit for proteogenomics spree,
Variants and sequences, headers redesigned,
Documentation flows with structured mind,
CodeRabbit's edits make knowledge shine,
Building tools that work just fine! 🧬

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ypriverol ypriverol merged commit b9f8d17 into master Mar 4, 2026
1 of 5 checks passed
Comment thread README.md
Comment on lines +64 to +66
pgatk generate-decoy \
--input variant_proteins.fa \
--output target_decoy.fa \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Wrong decoy cli flags 🐞 Bug ✓ Correctness

README Quick Start uses --input/--output for pgatk generate-decoy, but the CLI only defines
--input_database/--output_database (and -in/-out). Following the README will fail with Click “No
such option” errors and blocks the Quick Start.
Agent Prompt
## Issue description
README Quick Start documents `pgatk generate-decoy` with `--input` and `--output`, but the CLI only supports `--input_database` / `--output_database` (and `-in` / `-out`). Users will hit a Click error and cannot complete the Quick Start.

## Issue Context
The Click command is defined in `pgatk/commands/proteindb_decoy.py` and does not include `--input`/`--output` aliases.

## Fix Focus Areas
- README.md[63-67]
- pgatk/commands/proteindb_decoy.py[12-17] (optional: if adding aliases)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread README.md
Comment on lines +52 to +54
gffread -F -w ensembl_human/transcripts.fa \
-g ensembl_human/genome.fa \
ensembl_human/Homo_sapiens.GRCh38.*.gtf.gz
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Genome.fa not produced 🐞 Bug ⛯ Reliability

README Quick Start tells users to run gffread with -g ensembl_human/genome.fa, but
ensembl-downloader downloads the genome as a versioned *.dna_sm.toplevel.fa.gz file and the
codebase does not create a genome.fa convenience file. The Quick Start will fail unless the user
manually renames/decompresses, which is not documented.
Agent Prompt
## Issue description
README Quick Start uses `ensembl_human/genome.fa`, but `ensembl-downloader` saves the genome as a versioned `*.dna_sm.toplevel.fa.gz`. Users following the README will not find `genome.fa`.

## Issue Context
Downloader code constructs the genome filename as `{Species}.{Assembly}.dna_sm.toplevel.fa.gz` and downloads it directly.

## Fix Focus Areas
- README.md[51-55]
- pgatk/ensembl/data_downloader.py[474-483] (reference for actual filename pattern)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant