-
Notifications
You must be signed in to change notification settings - Fork 1.6k
fix(vector): Fix similarity-based HNSW search for cosine and dot product metrics #9559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(vector): Fix similarity-based HNSW search for cosine and dot product metrics #9559
Conversation
Previous impl assumed lower values were better (distance) which was breaking similarity search (cosine, dotp)
|
@joelamming As someone who's crawled around in this tokenizer, would you be willing to take a look at this PR for correctness/completeness? |
|
@matthewmcneely -- this is a great find. I ran some side-by-side synthetic self-retrieval tests with very tight
Cosine/dotproduct are effectively inverted in both candidate exploration and construction pruning. This looks like a construction correctness bug so cosine/dotprod deployments will need to rebuild/reindex their HNSW to see the benefit I've also spotted a couple of small possible completeness/correctness bits around the pruning heap usage/tests. Going to validate a minimal revision tomorrow and will come back with a diff + results if it holds up |
…9563) **Description** This PR is a small follow-up to @matthewmcneely's `matthewmcneely/fix-similarity-based-hnsw-indexing` that tightens the construction-time pruning path in `tok/hnsw/helper.go:addNeighbors`. - Makes construction pruning explicitly maintain the best-to-worst top‑k neighbour list capped at `efConstruction` (rather than relying on heap/slice side-effects) - Avoids duplicate/self edges during construction updates - Aligns the pruning unit test with the same helper used by production to reduce test/production drift This is intended to be easy to cherrypick into the [PR branch](#9559) ### Why The directionality fix is the key part (and @matthewmcneely's [PR](#9559) nails that), but the existing pruning loop still had a couple of structure-level issues (heap initialisation/slice truncation semantics) that could make construction behaviour hard to reason about -- especially under similarity metrics ### Testing - `go test ./tok/hnsw -count=1` - Synthetic recall harness (25k vectors) shows massively improved recall vs. `main` (per my [comment](#9559 (comment))) and stable miss-rate and expected build/query behaviour vs. #9559 **Checklist** - [x] The PR title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/#summary) syntax, leading with `fix:`, `feat:`, `chore:`, `ci:`, etc. - [x] Code compiles correctly and linting (via trunk) passes locally - [x] Tests added for new functionality, or regression tests for bug fixes added as applicable
|
@joelamming Thanks for reviewing and that PR. My benchmarking shows not only equal or improved recall, but also your rework on the heap has significantly increased insert speed. Good work! Here's the output of the cosine metric for
Other metrics are performing on par (just elided here to reduce noise). A note about search and insert speed for v25.1: the "excellent" performance is masking the fact that early/incorrect termination was allowing both to perform fast -- of course the indexes were crap and search results were usually poor (as evidenced by the recall stats). Also note that the benchmark suite is available at https://github.com/dgraph-io/dgraph-benchmarks/tree/main/vector/beir |
minPersistentHeapElement -> persistentHeapElement
Summary
This PR fixes a critical bug in HNSW vector search where cosine similarity and dot product metrics returned incorrect results. The search algorithm was treating all metrics as distance metrics (lower is better), causing similarity metrics (higher is better) to return the worst matches instead of the best.
Problem
The HNSW implementation had two issues with similarity-based metrics:
Search phase: The candidate heap in persistent_hnsw.go::searchPersistentLayer always used a min-heap, which pops the lowest value first. For similarity metrics where higher values are better, this caused the algorithm to explore the worst candidates first and terminate prematurely.
Edge pruning phase: The helper.go::addNeighbors function used a fixed comparison (
>) when pruning edges, which is correct for distance metrics but inverted for similarity metrics. This resulted in keeping the worst edges instead of the best.Root Cause
The original code assumed all metrics behave like distance metrics:
For Euclidean distance, lower values = better matches → min-heap is correct.
For Cosine/DotProduct similarity, higher values = better matches → need max-heap.
Solution
1. Added candidateHeap interface with metric-aware heap selection
2. Added isSimilarityMetric flag to SimilarityType
3. Fixed edge pruning comparison in addNeighbors
Files Changed
Testing
Added new tests covering:
Performance Note
This fix builds on PR #9514 which corrected the early termination condition. Together, these changes ensure HNSW search explores the correct number of candidates and returns properly ordered results.
Users experiencing slower insert/search times compared to v25.1.0 can tune performance by lowering efConstruction and efSearch parameters when creating your vector indexes.
Lower values trade recall for speed. The default values (efConstruction=128, efSearch=64) prioritize recall.
GenAI Notice
Parts of this implementation and all of the testing was generated using Claude Opus 4.5 (thinking).
Checklist
Conventional Commits syntax, leading
with
fix:,feat:,chore:,ci:, etc.Fixes #9558
Benchmarks
Our BEIR SciFact Information Retrieval Benchmarks now show recall rates close to or exceeding acceptable and excellent performance for all metrics.