Sledge splits protein sequence databases into train / test / (optional) validation sets (sledge_splitter), scores and filters sequence pairs with a pHMMER-based engine (phmmer_filter), and orchestrates multi-tool filtering pipelines (sledge_filter: pHMMER, MMseqs2, BLAST, FASTA ssearch36). Sledge is released under the MIT License (see License).
To cite Sledge, please use:
[SLEDGE_CITATION_PLACEHOLDER]
[SLEDGE_PAPER_LINK_PLACEHOLDER]
We would like to thank the developers of HMMER3 1 for the EASEL tools and base pHMMER pipeline that have been customized for sledge_splitter and sledge_filter.
| OS | Build support | sledge_filter support |
External tools support | Notes |
|---|---|---|---|---|
| Linux (x86_64) | Official | Official | Official | Primary supported platform. |
| macOS (Apple Silicon / Intel) | Official | Official | Official | Use Homebrew/system packages for dependencies. |
| Windows (WSL2) | Official (via Linux in WSL2) | Official (via Bash in WSL2) | Official (via Linux in WSL2) | Recommended Windows path. |
| Windows (native) | Planned | Planned | Planned | Not currently first-class; requires separate installers and shell adaptations. |
Core build tools (all supported platforms)
- C compiler (
gcc/clang) - GNU
make ar,ranlib- Bash (for
sledge_filterand install helpers) pthreadand math library (-lm)
Runtime dependencies for sledge_filter
- MMseqs2
- NCBI BLAST+ (
makeblastdb,blastp) - FASTA package (
ssearch36)
Refer to the external tools section for installation through make.
git clone https://github.com/sledgeseq/sledge.git
cd sledge/ # adjust if your repo root differs
./configure # detects CPU (SSE on x86_64, NEON on aarch64/arm64); writes build-config.mk
makeShell (bash vs zsh): It does not matter whether you use bash or zsh as your login shell. ./configure has a #!/usr/bin/env bash, so it always runs under bash. The same commands work in Terminal with zsh.
If you see “Permission denied” when running scripts: your clone may be missing executable bits. Re-enable with:
chmod +x configure install/install_external.sh install/macos/install_external_macos.shor
chmod +x configure install/install_external.sh install/linux/install_external_linux.shYou can also run scripts via bash <script> as a fallback (for example, bash configure).
Re-run ./configure after changing machines or if you want to force a backend (./configure --with-impl=sse or --with-impl=neon). make distclean removes build-config.mk as well as build products.
Executables land in bin/:
| Binary | Role |
|---|---|
sledge_splitter |
Profmark-style split of one FASTA DB into train / test / val |
phmmer_filter |
Pairwise sequence similarity filter using HMMER |
sledge_filter |
Shell pipeline: configurable order of p / m / b / s steps |
Add bin to your PATH or call tools with full paths:
export PATH="/path/to/sledge/bin:$PATH"Install external tools through the Makefile:
make install-externalThis installs MMseqs2, BLAST+, and FASTA tools into sledge/external_tools by default.
Linux (install/linux/install_external_linux.sh) selects downloads from uname -m: x86_64 uses MMseqs2 sse2 binaries and NCBI x64-linux BLAST; aarch64/arm64 uses MMseqs2 arm64 and NCBI aarch64-linux BLAST, and builds FASTA36 with non-SSE makefiles. Override MMseqs flavor with MMSEQS_ARCH (e.g. avx2 on capable x86_64) or NCBI tarball name with BLAST_PLATFORM if needed.
If external tools are already installed, point sledge_filter to them in your config (see install/test_filter.config):
| Config key | Set this to |
|---|---|
MMSEQS |
Full path to mmseqs executable |
BLAST_DIR |
Directory containing makeblastdb and blastp |
FASTA_DIR |
Directory containing ssearch36 |
sledge_filter can also run on a subset of tools. Only the tools used in --order need to be installed.
make install-external uses the cross-platform installer dispatcher under install/.
You can still call the dispatcher directly if needed:
./install/install_external.sh /path/to/sledgeThe fasta36 checkout used by the installer needs compatibility patches to build with current GCC and Clang; see FASTA36 GCC patches near the end of this README.
macOS: install/macos/install_external_macos.sh installs MMseqs2 and BLAST+ via Homebrew (brew, native for Intel vs Apple Silicon) when available, and otherwise falls back to direct upstream tarball downloads (mmseqs-osx-universal, ncbi-blast-<version>+-x64-macosx by default). FASTA36 is built from source, choosing FASTA makefiles by uname -m (ARM Macs no longer prefer x86-only makefiles first). Optional overrides: MMSEQS_TAG, MMSEQS_MACOS_ASSET, BLAST_VERSION, BLAST_PLATFORM.
On Windows, run the PowerShell entrypoint (WSL2-backed flow):
powershell -ExecutionPolicy Bypass -File .\install\windows\install_external_windows.ps1 -SledgeDir C:\path\to\sledgeAfter building binaries and configuring/installing external tools:
make test-installmake test-install auto-detects OS:
- Linux/macOS: runs
install/test_installation.sh - Windows: routes to
install/windows/test_installation_windows.ps1(WSL2-backed)
Direct Windows entrypoint is still available if needed:
powershell -ExecutionPolicy Bypass -File .\install\windows\test_installation_windows.ps1 -SledgeDir C:\path\to\sledgeSplitter — one input DB, written train/test/val under the current directory
sledge_splitter --dbblock 100 --test_limit 20 --val_limit 10 -o stats --output_dir results seqDB.fastaMulti-tool pipeline — order pmbs (pHMMER → MMseqs2 → BLAST → Smith–Waterman)
sledge_filter --order pmbs --config pipeline.config --fixed-file test.fasta --db-file train_candidates.fasta| Argument | Description |
|---|---|
<seqdb> |
One input protein sequence file (positional; must be last). |
| Option | Default | Description |
|---|---|---|
-o <prefix> |
- |
Prefix for stats / summary output files (e.g. stats → stats_0.txt). |
-Z <n> |
inferred from --dbblock | Effective database size for E-value calculation. |
--cpu <n> |
1 |
Worker threads. |
--dbblock <n> |
1000 |
Number of sequences in database. |
--test_limit <n> |
75000 |
Stop once at least this many sequences are assigned to test. |
--val_limit <n> |
10000 |
Stop once at least this many are assigned to validation (unless disabled). |
--init_chunk <n> |
10 |
Sequences considered for assignment at one go |
--seed <n> |
42 |
RNG seed (0 = one-time random seed). |
--suppress |
off | Disable progress bar. |
--task_id <id> |
0 |
Suffix for output files (*_0.fasta, etc.). |
| Option | Description |
|---|---|
--output_dir <dir> |
Write train, test, val, discard under <dir> (with default basenames unless overridden). |
--train_path, --test_path, --val_path, --discard_path |
Basename prefixes for the four FASTA streams (defaults: train, test, val, discard). |
| Option | Description |
|---|---|
-E <x> |
Treat hits with E-value ≤ x as significant. |
--plow, --phigh |
PID window for accepting a sequence into the train/test/val set (see -h for interaction with the algorithm). |
| Option | Description |
|---|---|
--load_tr, --load_te, --load_val |
Resume from existing train/test/val FASTA (- = none). |
--halt <n> |
Process at most n sequences (debug). |
--resume <n> |
Start after sequence index n. |
--freq <n> |
Progress update frequency. |
For complete set of options: run sledge_splitter -h.
| Option | Description |
|---|---|
--order <string> |
Tool order: p = pHMMER (phmmer_filter), m = MMseqs2, b = BLAST+, s = Smith–Waterman (ssearch36). Example: pmbs, spm. |
--config <file> |
Bash-sourced config (see below). |
--fixed-file <fasta> |
“Fixed” side (e.g. reference or test set). |
--db-file <fasta> |
Database to filter against the fixed set. |
| Option | Description |
|---|---|
--out-suffix <name> |
Suffix for per-tool output dirs (phmmer_<suffix>, …); defaults to TASK_ID from config. |
OUT_DIR must be set (base directory for outputs). Typically also set:
| Variable | Role |
|---|---|
PHMMER_FILTER |
Path to phmmer_filter binary. |
MMSEQS |
MMseqs2 executable. |
BLAST_DIR |
Directory containing makeblastdb and blastp. |
FASTA_DIR |
Directory containing ssearch36. |
TASK_ID |
Integer or SLURM (uses SLURM_ARRAY_TASK_ID). |
REMOVE_TARGET |
db or fixed — which side sequences are removed from after hits. |
E_VALUE |
E-value threshold (shared across tools where applicable). |
Z_SIZE |
Positive integer; used for pHMMER / SW calibration. |
SLEDGE_CORES, MMSEQS_CORES, BLAST_CORES, SW_CORES |
Thread counts for each tool. |
BLAST_DBSIZE, BLAST_MAX_TARGET_SEQS, MMSEQS_MAX_SEQS |
BLAST / MMseqs2 tuning. |
Optional: KEEP_INTERMEDIATES, USE_LONG_ID, etc. (see script header in src/sledge_filter.sh).
Minimal example config
OUT_DIR="./filter_run_out"
PHMMER_FILTER="/path/to/sledge_minimal/bin/phmmer_filter"
MMSEQS="mmseqs"
BLAST_DIR="/path/to/ncbi-blast/bin"
FASTA_DIR="/path/to/fasta/bin"
TASK_ID=0If you are not interested in using a suite of tools to filter and want to use just the optimized PHMMER filter for different functionalities, refer to the documentation below.
| Argument | Description |
|---|---|
<qdb> |
Query sequence database (first positional). |
<tdb> |
Target sequence database (second positional). |
One of qdb or tdb may be - (stdin), not both.
| Option | Default | Description |
|---|---|---|
-o <prefix> |
- |
Output prefix for result files. |
-Z <n> |
inferred from database by default | Database size for E-value calibration. |
-E <x> |
10.0 |
Reporting E-value threshold. |
--cpu <n> |
1 |
Threads. |
--qsize <n> |
1 |
Queries per thread per batch. |
--qblock <n> |
10 |
Query block size. |
--tblock <n> |
1000 |
Target block size. |
--format <n> |
1 |
Output format (see below). |
--all_hits |
off | Do not stop early on first failing comparison (when applicable). |
--task_id <id> |
0 |
Shard id in output filenames. |
--seed <n> |
42 |
RNG seed. |
--suppress |
off | Disable progress bar. |
| Value | Meaning |
|---|---|
0 |
Per-sequence ACCEPT/REJECT string. |
1 |
Full information for rejected hits (default). |
2 |
IDs of accepted sequences. |
3 |
IDs of rejected queries. |
4 |
IDs of rejected targets (e.g. with --all_hits). |
--qformat / --tformat force Easel format names (e.g. fasta) and skip autodetection.
-
Splitter — train/val/test from a single database
Build disjoint splits for benchmarking: tune--test_limit,--val_limit,-Z, and--cpufor large DBs.sledge_splitter -Z 1e6 --cpu 16 --test_limit 5000 --val_limit 1000 -o run stats_db.fasta
-
Filter — large dissimilar training set given a fixed test set
Runsledge_filterwithREMOVE_TARGET=dband strictE_VALUE/Z_SIZEso the “db” FASTA loses sequences similar to the test set (after pHMMER / MMseqs / BLAST / SW as ordered). -
Filter — remove overlap between any two FASTA sets
Point--fixed-fileand--db-fileat the two pools; chooseREMOVE_TARGETto drop hits from either side. -
Splitter / filter — out-of-distribution (OOD) evaluation
Use the splitter to hold out a test set; usesledge_filterorphmmer_filterto strip training candidates that match the test set, giving a cleaner OOD evaluation. Alternatively, prune the test set with respect to the train set to get a subset of OOD sequences. You can also deduplicate a set of sequences by removing ones that have high sequence similarity to each other within a set (this would involve filtering a set with itself). -
Strictness, order, and tool subset
- Strictness: lower
E_VALUE, appropriateZ_SIZE, andBLAST_DBSIZE/ MMseqs e-value scaling in the script. - Order: e.g.
ponly for fast homology removal;pmbsfor a long cascade. - Speed MMSeqs2 is the fastest of the tools, followed by BLAST and PHMMER (our modified implementation) in similar order of magnitude. Smith-Waterman based filtering by ssearch is the slowest part of this process. User can choose the tools and order as required.
- Subset: omit letters (e.g.
pmskips BLAST and SW).
- Strictness: lower
-
Stratification / reporting
Usephmmer_filter--formatmodes and-E/-Zto export accept/reject IDs or full hit tables; combine with downstream scripts to bin by PID or E-value from the tabular output. Our customized pHMMER filter allows reporting results in different forms. We have not modified the source code of MMSeqs2, BLAST and ssearch36.
sledge_splitter -h
phmmer_filter -hFor sledge_filter pipeline behavior and defaults, read the comments at the top of src/sledge_filter.sh.
Upstream wrpearson/fasta36 predates strict ISO C defaults and C23 keywords in recent compilers. Sledge ships a unified diff so the cloned sources compile cleanly on typical Linux and macOS toolchains.
| Item | Location / note |
|---|---|
| Patch | install/patches/fasta36-gcc-prototypes.patch |
| Applied by | Linux/macOS installers after git clone (see install/linux/install_external_linux.sh, install/macos/install_external_macos.sh) |
| System dependency | The patch command must be on PATH (install the patch package on minimal images if needed). The patch file is part of this repo; only the patch binary is external. |
Patches are cumulative: partial fixes (e.g. only mmap prototypes) are not enough. Grouped by theme:
- Headers / mmap:
altlib.h,mm_file.h,mmgetaa.c— correct prototypes and function-pointer types for library and mmap readers. - C23 / syntax:
pssm_asn_subs.c— avoid theboolkeyword clash; fullparse_pssm_*prototypes.lsim4.h—#include <stdbool.h>instead oftypedef int bool. - Tool front-ends:
map_db.c— typed function pointers forget_entry/get_ent_arr.initfa.c— ANSIsortbest, saferget_lambdaallocation.compacc2e.c— boundedcallocfor annotation tables; on UNIX,mktempreplaced withmkstempfor the temp library DB path. - FastA/FastX drops:
dropfx2.c,dropfz3.c— K&R-style forward declarations,lx_band/ckalloc,kssort,global/small_global/global_up/global_down(includingconstsequence pointers where needed),fatal.dropfs2.c,dropff2.c,dropnfa.c— ANSI prototypes for shell-sort helpers (kssort,kpsort,krsort) andsavemaxwhere applicable.
- Installers default to
FASTA36_REF=master(seeinstall/install_external.sh). The patch is generated againstmaster. If GitHub’s default branch or your chosen tag diverges, line numbers may not match—regenerate the diff from a clean checkout of that revision, or usemaster. - If upstream edits the same lines on
master, refresh the patch from a fresh clone and re-run the install build.
Nothing in the repo vendors a patch binary today. For air-gapped or minimal hosts, you could ship a static GNU patch (e.g. under install/bin/) and invoke it from the installers.
- Eddy SR (2011) Accelerated Profile HMM Searches. PLOS Computational Biology 7(10): e1002195. https://doi.org/10.1371/journal.pcbi.1002195
This project is licensed under the MIT License. See LICENSE for the full text.