Skip to content

cmzmasek/blastp-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

blastp_search

A command-line tool for running per-taxonomy BLASTP searches against NCBI, designed to ensure fair representation of hits across multiple taxonomic groups.

Why per-taxonomy?

Standard BLASTP returns a single pooled hit list dominated by well-annotated organisms. This tool runs one BLAST search per taxonomy so each taxon gets its own hit pool, then draws a random selection from each — preventing abundant taxa from crowding out rarer ones.

Features

  • Per-taxonomy BLASTP searches run in parallel
  • Filters by E-value, query coverage (merged multi-HSP), and percent identity
  • Random draw ensures diversity within each taxon
  • BLAST XML result caching (skip redundant NCBI round-trips on reruns)
  • Checkpoint/resume support for interrupted runs
  • Optional TSV matrix outputs: hit counts, median identity, hit stats, majority protein name
  • Server-side taxonomy exclusion via --exclude-taxonomies
  • NCBI API key support for higher rate limits

Requirements

pip install biopython tqdm

Usage

blastp_search.py -q QUERIES.fasta -t "Bacillus subtilis" "Streptomyces coelicolor" \
    --email you@example.com -o results/

Required arguments

Argument Description
-q / --query FASTA Input FASTA file with one or more protein sequences
--email EMAIL Your email address (required by NCBI)
-t / --taxonomies NAME [NAME ...] One or more taxonomy names

At least one taxonomy must be provided via -t or -T (or both).

Taxonomy input

Taxonomies can be supplied on the command line or from a file:

# File with one name per line; optionally add a tab + integer to override -n for that taxon
Bacillus subtilis
Streptomyces coelicolor    5
Mycobacterium tuberculosis
blastp_search.py -q query.fasta -T taxa.txt --email you@example.com

Key options

Option Default Description
-e / --evalue 1e-5 Maximum E-value
-c / --coverage 50.0 Minimum query coverage (%)
-i / --identity 0.0 Minimum percent identity (0 = no filter)
-n / --max-seqs 10 Max sequences returned per taxonomy per query
--hitlist-size 500 BLAST hit pool size per search
--workers 3 Parallel taxonomy searches per query
--api-key KEY NCBI API key (raises rate limit to 10 req/s; use with --workers 10)
--seed INT Random seed for reproducible draws (use --workers 1 for full reproducibility)
--no-cache off Disable BLAST XML caching
--cache-dir DIR <outdir>/cache Cache directory
--resume off Skip completed (query, taxonomy) pairs from a previous run
--no-multispecies off Exclude MULTISPECIES hits
--exclude-taxonomies NAME [NAME ...] Exclude taxa from all searches (server-side)

Optional TSV matrix outputs

Each flag writes a continuously updated TSV matrix (rows = taxonomies, columns = query IDs, with a lineage column):

Option Content
--hit-counts FILE Number of hits passing all filters (before draw)
--selected-hit-median-identity FILE Median percent identity of selected hits
--selected-hit-stats FILE E-value range, identity range+median, coverage range+mean
--selected-hit-majority-name FILE Most common protein name among selected hits

Output

results/
├── <query_id>.fasta       # selected hit sequences (one file per query)
├── summary.tsv            # one row per selected hit across all queries and taxa
├── checkpoint.txt         # completed (query, taxonomy) pairs for --resume
└── cache/                 # cached BLAST XML results

FASTA sequence headers are annotated with E-value, percent identity, and query coverage:

>sp|P12345|PROT_BACS evalue=1.5e-42 id=72.3% qcov=95.0% | ...original description...

summary.tsv columns: query_id, query_length, taxonomy, accession, description, evalue, identity_pct, query_coverage_pct, output_fasta.

Example

# Basic run with two taxa
blastp_search.py \
    -q proteins.fasta \
    -t "Bacillus subtilis" "Streptomyces coelicolor" \
    --email you@example.com \
    -n 5 \
    -o results/

# With NCBI API key, more workers, and all matrix outputs
blastp_search.py \
    -q proteins.fasta \
    -T taxa.txt \
    --email you@example.com \
    --api-key YOUR_KEY \
    --workers 10 \
    --hit-counts results/hit_counts.tsv \
    --selected-hit-stats results/hit_stats.tsv \
    --selected-hit-majority-name results/protein_names.tsv \
    -o results/

# Resume an interrupted run
blastp_search.py \
    -q proteins.fasta \
    -T taxa.txt \
    --email you@example.com \
    --resume \
    -o results/

License

Copyright (C) 2026 Christian M. Zmasek. Licensed under the GNU General Public License v3.0.

Packages

 
 
 

Contributors

Languages