WizardEye is a Python tool that filters aligned reads based on the risk of ambiguous alignment to a reference genome.
WizardEye is currently in beta and under active development. Future updates will bring new features, improve stability, and ensure robustness through comprehensive unit testing.
To achieve this, WizardEye first identifies all positions in your reference genome that could be targeted by reads from known ambiguous sources using your alignment parameters. For example, it can filter out reads that map to regions conserved between your reference genome and potentially contaminating organisms.
The tool both creates a database of reference genomes and their associated risky sources and filters BAM files using that information. By directly using bwa, WizardEye empirically identifies risky reads based on your alignment parameters.
This tool is directly adapted from the script generate_cross_mappability_filter_bwa.sh.
WizardEye first splits the potential contaminant source into -k-mers with a sliding window of -w. It then aligns each produced unique k-mer using bwa aln with your parameters on the target reference to highlight ambiguous regions of that sequence. As this step can be computationally intensive for a complete genome, it stores the computed cross-mappability track in a database.
You can then use these cross-mappability tracks to filter your alignment. For example, if you are studying the evolution of Hominidae and align your reads to the human genome, you can use WizardEye to remove reads that could also come from other mammalian sources, such as hyena or deer, using several cross-mappability tracks generated from non-Hominidae mammalian genomes.
Since WizardEye relies only on alignment and prior knowledge, it is recommended for use when you have an idea of the potential contamination types in your sample. The more exhaustive the database, the more effective (and restrictive) the filtering.
Note that WizardEye only identifies ambiguous regions that can align. It cannot filter sequences that are very distant from the target. Therefore, it is highly recommended as a complementary step to traditional large-scale filtering. Finally, note that WizardEye filtering can be very strict, leading to the loss of substantial information, as it prioritizes risk avoidance over the statistical power of post-hoc analyses.
The latest WizardEye version can be installed using conda:
conda install -c bioconda wizardeye
You can also install WizardEye by cloning this repository and running the following commands from the main folder:
python3 -m venv .wizardeye
source .wizardeye/bin/activate
python -m pip install -U pip
python -m pip install -e .
The following command-line tools must be installed and available in your PATH:
bwa— for read alignment and reference indexingsamtools— for SAM/BAM file manipulation (conversion, sorting, indexing, concatenation)seqkit— for k-mer generation, deduplication, and FASTA chunkingbedtools— for interval operations (genomecov, merge)bedGraphToBigWig— UCSC tool for bedGraph to BigWig conversion (package:ucsc-bedgraphtobigwig)bigWigToBedGraph— UCSC tool for BigWig to bedGraph conversion (package:ucsc-bigwigtobedgraph)parallel— GNU parallel for parallel job executionawk,cat,sort— standard Unix utilities
The following Python packages are automatically installed via pip or conda:
typer>=0.9— for building the command-line interfacePyYAML>=6.0— for YAML configuration file handlingpysam>=0.22— for BAM file parsing and interval extractionpyBigWig>=0.3.25— for BigWig file parsing and k-mer overlap computationnumpy— for numerical operations
To start, create a WizardEye database using the following command:
wizardeye database init -d /path/to/database
This generates a /database/ directory compatible with WizardEye. You can move it elsewhere or rename the folder after initialization.
The database is a directory containing subdirectories for targets and processed tracks.
database/
├── hg19/ # reference
│ ├── sus_scrofa_k35_bwa8651fd65/ # a track aligned to the reference
│ │ ├── map_all.bw # all overlapping k-mers
│ │ ├── map_uniq.bw # unique overlapping k-mers
│ │ └── info.yaml # information about the track
...
│ ├── ref.md5 # reference md5
│ └── ref.yaml # information about the reference
└── info.yaml # information about the database
Every subdirectory contains two BigWig files that can be opened in a traditional genome browser. The map_all.bw file, for every position in the reference genome, represents the number of overlapping k-mers from the risky sequences, while map_uniq.bw contains only unique k-mers (i.e., k-mers overlapping only that genomic region). These two files enable stringency computation.
You can update tags for an existing track using the database command by providing the full track-defining parameters and replacement tags:
wizardeye database update track-tags -d /path/to/database \
-r hg19 -q bos_taurus -k 35 -w 1 -bn 0.01 -bo 2 -bl 16500 -bN false -bj 1 -bR 30 -bsn 2000000000 \
--tags Mammalia,Ruminantia
Note that masks generated during export are cached in the database to speed up subsequent exports. You can avoid this by specifying --no-cache before export. You can delete the cache using the following command:
wizardeye database clean -d /path/to/database
Version 0.1.3 introduced new track naming. To migrate an existing database from an older version to 0.1.3, use the following command:
wizardeye database migrate -d /path/to/database --from 0.1.2 --to 0.1.3 -bR 30 -bsn 2000000000
You can compute ambiguous regions of reference.fa that can be targeted by reads from risky.fa using specific BWA parameters:
wizardeye align -i /path/to/risky.fa -r /path/to/reference.fa -d /path/to/database \
-k 35 -w 1 \
-bn 0.01 -bo 2 -bl 16500
You can provide tags to describe risky.fa (e.g., phylogeny and/or environment: -t Mammalia,Carnivora,Felis,Cave). This enables direct BAM file filtering based on specific tags. You can manually set a track identifier with --track_ID to store it in the database metadata. Note that to prevent misuse, a new track cannot be added if the reference alignment file differs, including in sequence names. This is enforced by an MD5-based check to prevent silently corrupted analyses.
As WizardEye uses seqkit split2 to distribute alignment computation across threads, several parameters can speed this up:
| Parameters | Default | Definition |
|---|---|---|
--tmp_dir |
TMPDIR |
Temporary directory for processing chunks and alignments |
--chunk_size |
2000000 |
Number of sequences per chunk |
--jobs |
1 |
Number of threads for generating chunks and running parallel bwa instances |
--bwa_threads |
1 |
Number of threads per bwa instance |
Note
With default parameters (-n 0.01 -o 2 -l 16500 -k 35 -w 1), a cross-mappability track based on hg19 coordinates is generated from a complete mammalian genome in ~48 hours using 128 threads (--chunk_jobs 128 --bwa_threads 1). Such computation requires ~200GB of free storage in the temporary directory due to the large number of k-mers generated before alignment.
Several bwa parameters control alignment exhaustiveness and, consequently, track exhaustiveness.
| Parameters | Default | Definition |
|---|---|---|
-bR |
30 |
Number of hits to reach before stopping suboptimal hit search |
-bsn |
2000000000 |
Limit of alternative alignments in the SAM file |
-bN |
False |
Should all alternative alignments be identified? |
The -R parameter is the most important. It defines the number of hits to reach before stopping the search for suboptimal hits. Hits are processed in score batches; the search only stops if this threshold is reached AND all hits with the same score have been processed. For example, with -R=30 (the default for bwa and WizardEye), in a scenario with 50 best hits and 100 suboptimal hits, all 50 best hits are included in the final cross-mappability track, while the others are not.
Importantly, in classic bwa aln usage, the suboptimal search is also limited by the mismatch difference between the best hit and suboptimal hits. Thus, even with a large -R, the search identifies best hits and suboptimal hits with up to 1 mismatch compared to the best ones, regardless of -n.
If you want to avoid this behavior and be truly exhaustive in your cross-mappability track, you can specify the -bN parameter, which is passed as -N to bwa aln. This parameter forces BWA to save all alternative alignments, regardless of their quality and mismatches (within the -n (bwa_missing_prob_err_rate) limit). This parameter allows your cross-mappability track to be exhaustive at the expense of higher coverage. It can be used for extreme filtering but reduces the ability to retain informative reads. Be careful with this option: for complete genomes with repetitions and allowed mismatches, it can lead to biased representation due to bwa aln's memory limit (alternative mappings are truncated). See issue #3 for more details.
Finally, the bwa samse -n parameter (accessible via -bsn in WizardEye) limits the number of alternative alignments retained in the SAM file. If there are more alternative alignments than this limit, bwa removes them all and sets MAPQ to 0. This value is set to a large number in WizardEye to avoid losing information and should not be modified.
Warning
Due to BWA's behavior, coverage and intervals are not perfectly deterministic across systems and alignment chunk sizes. In fact, BWA randomly selects the primary alignment among best hits using a random number generator and deletes secondary alignments starting at the exact same position. This can lead to tiny coverage differences and, in some intervals, a 1 bp difference at the end position. However, track generation is consistent and deterministic if WizardEye is run on the same system with the same alignment chunk, regardless of the number of threads used. See issue #1 for more details.
Once you have a representative database, you can filter reads from your alignment:
wizardeye filter -i alignment.bam -r hg19 --exclude-tags Cave -k 35 -w 1 -bn 0.01 -bo 2 -bl 16500 -d /path/to/database
This filters out all reads that could be ambiguously aligned to your target if they come from a source with a tag in --exclude-tags.
For filter, -r must be a reference identifier already in your WizardEye database (e.g., hg19), and -d/--db-root is mandatory.
By default, WizardEye considers all reads that can align to the reference. However, to only consider reads aligning to a single position (MAPQ>0, no XA tag), specify --only-unique.
You can also filter out reads based on specific tracks:
wizardeye filter -i alignment.bam -r hg19 --exclude-tracks myotis_alcathoe,ursus_arctos -k 35 -w 1 -bn 0.01 -bo 2 -bl 16500 -d /path/to/database
Warning
If sequence naming differs between BAM and tracks (e.g., chr1 vs 1), filtering stops with an explicit error. Harmonize contig names beforehand.
To balance sensitivity and specificity, you can specify a stringency value during filtering. This criterion is defined as follows.
For a position P in the target, there are exactly -k/-w different possible k-mers overlapping this position perfectly. After mapping, n k-mers can overlap the position (maximum -k/-w if no mismatches are allowed, otherwise more). The position is then highlighted as ambiguous if n/(-k/-w) >= -rc; i.e., if the proportion of overlapping k-mers relative to the total possible k-mers exceeds the stringency.
If --only-unique is used, the same behavior and formula apply, but only uniquely aligned k-mers are considered (i.e., k-mers without BWA's XA tag and with MAPQ>0). For example, with one mismatch allowed, -k=40, and -rc=0.25, if a position is overlapped by 10 exact k-mers and 10 k-mers with one mismatch, the position is retained only if 10 of these k-mers are unique, regardless of exactness. With -rc=0.50, the region is highlighted only if all 20 k-mers map uniquely to it.
wizardeye filter -i alignment.bam -r hg19 --exclude-tracks myotis_alcathoe,ursus_arctos -k 35 -w 1 -bn 0.01 -bo 2 -bl 16500 -p 0.25 -d /path/to/database --only-unique
Note that the ratio is not weighted by depth or mismatches. In another case with a repetitive region, if a position is overlapped by 3000 k-mers, the ratio remains the same.
This behavior aims to reproduce Heng Li's seqbility logic, which is not directly usable in a cross-mappability context.
Using a frequency threshold, a read is filtered out only if a position is at least overlapped by minimum -mf tracks. This threshold is computed after the stringency. Then, a read is excluded if one of its position is covered by at least -mf tracks that exceeds the stringency threshold at that same position. You can apply such threshold using:
wizardeye filter -i alignment.bam -r hg19 --exclude-tracks myotis_alcathoe,ursus_arctos -mf 3 -d /path/to/database
Note that the -mf parameter is also available in the export command to generate masks with frequency filtering applied. This will remove lines with less than -mf from the final bed file.
WizardEye produces a tabulation-separated report containing, for each read, the decision and overlapping tracks, as follows:
| read_id | filtered_out | associated_tracks |
|---|---|---|
| read_1 | true | sus_scrofa,bos_taurus |
| read_2 | false | |
| read_3 | true | felis_catus |
| read_4 | true | bos_taurus |
| read_5 | true | bos_taurus,ovis_aries |
| read_6 | false |
With --export-bam, WizardEye produces two additional files:
excluded.bam: reads excluded by the filtration,filtered.bam: reads retained by the filtration.
If you prefer to use your own filter, you can export a per-read report summarizing the number of k-mers overlapping the reference per read interval using the count command.
This command accepts the same parameters as filter, plus a --mode parameter to specify the statistical summary type (sum, max, min, cov, mean, or std). Statistics are computed per interval. For example, if max is specified, for each track, the maximum number of overlapping k-mers from that track at any position in the read's interval is reported.
wizardeye count -i alignment.bam -r hg19 --exclude-tags Farm -k 35 -w 1 -bn 0.01 -bo 2 -bl 16500 -d /path/to/database -m max
In this example, tracks not reported by filter with -r=0.01 have a 0 in their column, meaning the maximum number of overlapping k-mers in the interval is 0.
WizardEye produces a tabulation-separated report containing, for each read, the requested statistics:
| read_id | max_sus_scrofa | max_bos_taurus | max_ovis_aries |
|---|---|---|---|
| read_1 | 0 | 4524 | 0 |
| read_2 | 422 | 0 | 0 |
| read_3 | 151 | 1254 | 1515 |
| read_4 | 0 | 0 | 0 |
| read_5 | 0 | 0 | 311 |
| read_6 | 122 | 0 | 111 |
If you plan to use the same configuration frequently (e.g., in a pipeline), you can export a mask to avoid recomputing it continuously:
wizardeye export -r hg19 --exclude-tags Cave -k 35 -w 1 -bn 0.01 -bo 2 -bl 16500 -d /path/to/database -o mask.bed
If you computed a track outside WizardEye, you can import it manually by providing both BigWig files and the parameters used to generate them:
wizardeye import -d /path/to/database -r ref -i input -k 35 -w 20 \
--map-all-bw /path/to/map_all.bw \
--map-uniq-bw /path/to/map_uniq.bw \
--reference-fasta /path/to/reference.fa \
--input-fasta /path/to/input.fa \
-bn 0.01 -bo 2 -bl 16500 -j 8 \
-t Mammalia,Carnivora
This command creates the target/track directory, copies the two BigWig files as map_all.bw and map_uniq.bw, and writes a param.yaml file with the provided generation metadata.
It is recommended to complement your filtering with an evolutionarily-aware method such as Kraken2. This combination is useful for removing both reads from completely different organisms and reads that may be ambiguous between closely related organisms.
Last documentation update: 0.1.3.
