A command-line tool for running per-taxonomy BLASTP searches against NCBI, designed to ensure fair representation of hits across multiple taxonomic groups.
Standard BLASTP returns a single pooled hit list dominated by well-annotated organisms. This tool runs one BLAST search per taxonomy so each taxon gets its own hit pool, then draws a random selection from each — preventing abundant taxa from crowding out rarer ones.
- Per-taxonomy BLASTP searches run in parallel
- Filters by E-value, query coverage (merged multi-HSP), and percent identity
- Random draw ensures diversity within each taxon
- BLAST XML result caching (skip redundant NCBI round-trips on reruns)
- Checkpoint/resume support for interrupted runs
- Optional TSV matrix outputs: hit counts, median identity, hit stats, majority protein name
- Server-side taxonomy exclusion via
--exclude-taxonomies - NCBI API key support for higher rate limits
pip install biopython tqdm
blastp_search.py -q QUERIES.fasta -t "Bacillus subtilis" "Streptomyces coelicolor" \
--email you@example.com -o results/
| Argument | Description |
|---|---|
-q / --query FASTA |
Input FASTA file with one or more protein sequences |
--email EMAIL |
Your email address (required by NCBI) |
-t / --taxonomies NAME [NAME ...] |
One or more taxonomy names |
At least one taxonomy must be provided via -t or -T (or both).
Taxonomies can be supplied on the command line or from a file:
# File with one name per line; optionally add a tab + integer to override -n for that taxon
Bacillus subtilis
Streptomyces coelicolor 5
Mycobacterium tuberculosis
blastp_search.py -q query.fasta -T taxa.txt --email you@example.com
| Option | Default | Description |
|---|---|---|
-e / --evalue |
1e-5 |
Maximum E-value |
-c / --coverage |
50.0 |
Minimum query coverage (%) |
-i / --identity |
0.0 |
Minimum percent identity (0 = no filter) |
-n / --max-seqs |
10 |
Max sequences returned per taxonomy per query |
--hitlist-size |
500 |
BLAST hit pool size per search |
--workers |
3 |
Parallel taxonomy searches per query |
--api-key KEY |
— | NCBI API key (raises rate limit to 10 req/s; use with --workers 10) |
--seed INT |
— | Random seed for reproducible draws (use --workers 1 for full reproducibility) |
--no-cache |
off | Disable BLAST XML caching |
--cache-dir DIR |
<outdir>/cache |
Cache directory |
--resume |
off | Skip completed (query, taxonomy) pairs from a previous run |
--no-multispecies |
off | Exclude MULTISPECIES hits |
--exclude-taxonomies NAME [NAME ...] |
— | Exclude taxa from all searches (server-side) |
Each flag writes a continuously updated TSV matrix (rows = taxonomies, columns = query IDs, with a lineage column):
| Option | Content |
|---|---|
--hit-counts FILE |
Number of hits passing all filters (before draw) |
--selected-hit-median-identity FILE |
Median percent identity of selected hits |
--selected-hit-stats FILE |
E-value range, identity range+median, coverage range+mean |
--selected-hit-majority-name FILE |
Most common protein name among selected hits |
results/
├── <query_id>.fasta # selected hit sequences (one file per query)
├── summary.tsv # one row per selected hit across all queries and taxa
├── checkpoint.txt # completed (query, taxonomy) pairs for --resume
└── cache/ # cached BLAST XML results
FASTA sequence headers are annotated with E-value, percent identity, and query coverage:
>sp|P12345|PROT_BACS evalue=1.5e-42 id=72.3% qcov=95.0% | ...original description...
summary.tsv columns: query_id, query_length, taxonomy, accession, description, evalue, identity_pct, query_coverage_pct, output_fasta.
# Basic run with two taxa
blastp_search.py \
-q proteins.fasta \
-t "Bacillus subtilis" "Streptomyces coelicolor" \
--email you@example.com \
-n 5 \
-o results/
# With NCBI API key, more workers, and all matrix outputs
blastp_search.py \
-q proteins.fasta \
-T taxa.txt \
--email you@example.com \
--api-key YOUR_KEY \
--workers 10 \
--hit-counts results/hit_counts.tsv \
--selected-hit-stats results/hit_stats.tsv \
--selected-hit-majority-name results/protein_names.tsv \
-o results/
# Resume an interrupted run
blastp_search.py \
-q proteins.fasta \
-T taxa.txt \
--email you@example.com \
--resume \
-o results/Copyright (C) 2026 Christian M. Zmasek. Licensed under the GNU General Public License v3.0.