Like the Greek god of artisans and blacksmiths forged the weapons of gods, this software forges workloads for approximate nearest neighbor search algorithms.
Consider the classic fashion-mnist dataset (which can also be downloaded in HDF5 format from ann-benchmarks). It contains 28x28 images of apparel items, that can be compared by the Euclidean distance.
Using hephaestus.py, we can create three queries with different levels of hardness
(more on this below):
The top row of plots reports the synthesized queries, from left to right the grow harder to answer. The tiny gray images below each query report the 10 nearest neighbors of each query. You can see that harder queries have a more diversified set of answers.
We consider two different ways of measuring hardness: explicative hardness measures try to capture the geometric relationship between a query and the dataset, while empirical hardness measures capture the effort invested by an index data structure to answer a query.
For instance, the Relative Constrast is an explicative difficulty measure defined as the ratio between the average
distance of a query point with the dataset with the distance between the query and its
where
Empirical hardness measures are defined in terms of a specific index data structure (e.g. faiss's IVF or HNSW) and a
target recall value. In particular, the empirical hardness
Going back to the figure above you can see that the easiest query has a hardness of about 4, and indeed the IVF index has an empirical hardness of about 4.6%, while HNSW computes only 0.3% of the total number of distances. Moving to the hardest query of the three (relative contrast 1.09) the indices have a much harder time: IVF computes 45.1% of the distances, HNSW 12.8%.
To install hephaestus as a CLI application, simply run the following:
pip install git+https://github.com/cecca/hephaestus/
Then, you can get some information about its options by running
hephaestus --help
The program accepts HDF5 files as input (organized as in ann-benchmarks and VIBE, that is we expect to have a "train" dataset in the file), and produces HDF5 files as output. To specify the number of queries to generate and the target relative contrast for them, you should use the --queries parameter, that should be followed by a string with the format N:RC, where N is the number of queries to generate, and RC is the target relative contrast. The --queries parameter can be issued multiple times.
Therefore, the command
hephaestus \
--dataset fashion-mnist-784-euclidean.hdf5 \
--output queries.hdf5 \
--queries 10:2.7 \
--queries 10:1.5 \
-k 10
generates a queries.hdf5 file containing 20 queries: 10 having a relative contrast of about 2.7, and the other 10 with a relative contrast of about 1.5.
For all queries, the relative contrast is computed with respect to
