blastp_search

A command-line tool for running per-taxonomy BLASTP searches against NCBI, designed to ensure fair representation of hits across multiple taxonomic groups.

Why per-taxonomy?

Standard BLASTP returns a single pooled hit list dominated by well-annotated organisms. This tool runs one BLAST search per taxonomy so each taxon gets its own hit pool, then draws a random selection from each — preventing abundant taxa from crowding out rarer ones.

Features

Per-taxonomy BLASTP searches run in parallel
Filters by E-value, query coverage (merged multi-HSP), and percent identity
Random draw ensures diversity within each taxon
BLAST XML result caching (skip redundant NCBI round-trips on reruns)
Checkpoint/resume support for interrupted runs
Optional TSV matrix outputs: hit counts, median identity, hit stats, majority protein name
Server-side taxonomy exclusion via --exclude-taxonomies
NCBI API key support for higher rate limits

Requirements

Python 3.9+
Biopython
tqdm

pip install biopython tqdm

Usage

blastp_search.py -q QUERIES.fasta -t "Bacillus subtilis" "Streptomyces coelicolor" \
    --email you@example.com -o results/

Required arguments

Argument	Description
`-q / --query FASTA`	Input FASTA file with one or more protein sequences
`--email EMAIL`	Your email address (required by NCBI)
`-t / --taxonomies NAME [NAME ...]`	One or more taxonomy names

At least one taxonomy must be provided via -t or -T (or both).

Taxonomy input

Taxonomies can be supplied on the command line or from a file:

# File with one name per line; optionally add a tab + integer to override -n for that taxon
Bacillus subtilis
Streptomyces coelicolor    5
Mycobacterium tuberculosis

blastp_search.py -q query.fasta -T taxa.txt --email you@example.com

Key options

Option	Default	Description
`-e / --evalue`	`1e-5`	Maximum E-value
`-c / --coverage`	`50.0`	Minimum query coverage (%)
`-i / --identity`	`0.0`	Minimum percent identity (0 = no filter)
`-n / --max-seqs`	`10`	Max sequences returned per taxonomy per query
`--hitlist-size`	`500`	BLAST hit pool size per search
`--workers`	`3`	Parallel taxonomy searches per query
`--api-key KEY`	—	NCBI API key (raises rate limit to 10 req/s; use with `--workers 10`)
`--seed INT`	—	Random seed for reproducible draws (use `--workers 1` for full reproducibility)
`--no-cache`	off	Disable BLAST XML caching
`--cache-dir DIR`	`<outdir>/cache`	Cache directory
`--resume`	off	Skip completed (query, taxonomy) pairs from a previous run
`--no-multispecies`	off	Exclude MULTISPECIES hits
`--exclude-taxonomies NAME [NAME ...]`	—	Exclude taxa from all searches (server-side)

Optional TSV matrix outputs

Each flag writes a continuously updated TSV matrix (rows = taxonomies, columns = query IDs, with a lineage column):

Option	Content
`--hit-counts FILE`	Number of hits passing all filters (before draw)
`--selected-hit-median-identity FILE`	Median percent identity of selected hits
`--selected-hit-stats FILE`	E-value range, identity range+median, coverage range+mean
`--selected-hit-majority-name FILE`	Most common protein name among selected hits

Output

results/
├── <query_id>.fasta       # selected hit sequences (one file per query)
├── summary.tsv            # one row per selected hit across all queries and taxa
├── checkpoint.txt         # completed (query, taxonomy) pairs for --resume
└── cache/                 # cached BLAST XML results

FASTA sequence headers are annotated with E-value, percent identity, and query coverage:

>sp|P12345|PROT_BACS evalue=1.5e-42 id=72.3% qcov=95.0% | ...original description...

summary.tsv columns: query_id, query_length, taxonomy, accession, description, evalue, identity_pct, query_coverage_pct, output_fasta.

Example

# Basic run with two taxa
blastp_search.py \
    -q proteins.fasta \
    -t "Bacillus subtilis" "Streptomyces coelicolor" \
    --email you@example.com \
    -n 5 \
    -o results/

# With NCBI API key, more workers, and all matrix outputs
blastp_search.py \
    -q proteins.fasta \
    -T taxa.txt \
    --email you@example.com \
    --api-key YOUR_KEY \
    --workers 10 \
    --hit-counts results/hit_counts.tsv \
    --selected-hit-stats results/hit_stats.tsv \
    --selected-hit-majority-name results/protein_names.tsv \
    -o results/

# Resume an interrupted run
blastp_search.py \
    -q proteins.fasta \
    -T taxa.txt \
    --email you@example.com \
    --resume \
    -o results/

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
blastp_search.py		blastp_search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blastp_search

Why per-taxonomy?

Features

Requirements

Usage

Required arguments

Taxonomy input

Key options

Optional TSV matrix outputs

Output

Example

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

blastp_search

Why per-taxonomy?

Features

Requirements

Usage

Required arguments

Taxonomy input

Key options

Optional TSV matrix outputs

Output

Example

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages