WoW: A Window-to-Window Incremental Index for Range-Filtering Approximate Nearest Neighbor Search, SIGMOD 2026 (Round 2)
This is the header-only library of WoW, for the benchmarking repository, please check here.
Despite for range-filtering scenarios, it can also be used as a generic predicate-filtering index.
- A modern C++ compiler that supports C++20
- CMake 3.20 or higher
- OpenMP
- Dependencies: hnswlib for fast distance computation, ygg for WBT definition.
This is a header-only library, just include the header wow/index.hh in your project.
We provide a python binding for WoW index to support a variety of attribute types, which is implemented using pybind11. Please refer to the README in the python folder for more details.
We provide a python example to build and search the WoW index. The example is in the example/ folder: example_sequential.py and example_arbitrary.py.
These examples require no dataset configuration by creating random vectors and attribute values and are good for a quick start.
We provide a list of detailed shell scripts to run the WoW index on benchmarked datasets using real-world datasets. The dataset format is described in the following section.
-
example/build_wow_bench.sh: build a WoW index taken the vector ID as the attribute. -
example/build_wow_common.sh: build a WoW index for arbitrary attributes (allow duplicated attribute values). -
example/search_wow_bench.sh: search the top-k nearest neighbors with range filter using WoW index for benchmarking, given the ranges with fractions varying from$2^{-16}-2^0$ including a mixed range fraction workload. Refer to the paper for the detailed introduction. -
example/search_wow_common.sh: search the top-k nearest neighbors with range filter using WoW index for arbitrary attributes (allow duplicated attribute values). -
example/search_wow_generic.sh: we also notice that WoW can be used as a generic predicate-filtering index for ANN search (just like ACORN). However, the performance is not guaranteed to be the best but it is still competitive (10% points to pass the filter, reach 97% recall), this will use the index generated byexample/build_wow_bench.sh.
If you want to test with your own datasets using our scripts for C++ usage, please follow the instructions below.
We provide the urls of the benchmarked vector datasets used in the paper:
- SIFT1M: SIFT1M dataset, the synthetic attribute is generated by the authors.
- GIST1M: GIST1M dataset, the synthetic attribute is generated by the authors.
- ArXiv: a recent hybrid dataset released by QDrant. It involves vectors embedded from over two million paper titles using the all-MiniLM-L6 encoder, along with release time as the attribute.
- Wikidata: a multilingual dataset released by Cohere, and the English version subset is used. Two sizes of the dataset are used in the paper: 4M and 41M (complete dataset). The attribute is the entity title, ranked by the dictionary order.
- Deep: obtained from the last fully-connected layer of the GoogleNet model with image size as the attribute. Two sizes of the dataset are used in the paper: 10M and 100M.
For more details, please refer to the paper. The vector file format is .fvecs.
- Base attribute is a .bin file with
nb * sizeof(int)bytes, wherenbis the number of vectors in the dataset. It is required byexample/build_wow_common.shandexample/search_wow_common.sh. - Query range is a .bin file with
nq * sizeof(int) * 2bytes, wherenqis the number of queries. It is required by all range search scripts. - For generic predicate-filtering script (
example/search_wow_generic.sh), no attribute file or query range file is required. It will use the vector ID as the attribute (please build the index using vector IDs) and randomly generate the bitmap for allowed vector IDs. The--npassparameter is used to specify how many IDs can pass the filter.
The search_wow_common and search_wow_bench will generate the ground truth file if it can not be found in the provided path. You can also provide your own ground truth file in .ivecs format.