-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hi @rsharris
this is not actually an an issue, but a request to add a new feature in HowDeSBT if possible.
I’m using your software as a dependency of MetaSBT where I’m building SBTs starting from genomes, microbial genomes in particular.
It would be very useful if HowDeSBT could provide a way to compute the Average Nucleotide Identity (ANI) measure between two bloom filters with the bfdistance subcommand.
It would be very useful if it could also compute the ANI between filters while running a query. Let’s say that I have an SBT built with genomes. Now, I have a new genome (its bloom filter representation) and I want to establish which is the closest one in the tree according to the ANI distance.
Computing the ANI usually means performing alignment, but we could use the following formula as a very good estimation of ANI based on the number of active bits in the union and intersection between two bloom filters:
1 - (1 + (1/kmer_size) * log((2*jaccard_index) / (1+jaccard_index)))Where kmer_size is obviously the size of the kmers used to build the bloom filters, and the jaccard_index is simply the number of active bits in the intersection of two bloom filters over the number of active bits at the union of the same two bloom filters intersection/union.
I’m currently computing this measure within MetaSBT by running the bfdistance subcommand twice with --show:intersect and --show:union. It works as expected, but of course it’s very inefficient. It would be much faster if HowDeSBT could provide this feature natively.
Let me know if it makes sense and if there is any chance we could see this feature implemented in the near future.
Thanks!