fix(retrieval): make proj: tag queries exact (BM25-only + mustContain)#56
fix(retrieval): make proj: tag queries exact (BM25-only + mustContain)#561034378361 wants to merge 1 commit intowin4r:mainfrom
Conversation
rwmjhb
left a comment
There was a problem hiding this comment.
Thanks for the PR! The approach is sound — routing tag-style queries to BM25 + mustContain is a clean solution for the false-positive problem described in #55.
I verified locally that LanceDB FTS treats : as a tokenizer (not a Tantivy field separator), so the BM25 stage correctly returns candidates and mustContain filters them down. The core logic works.
A few suggestions before merging:
1. Missing test coverage
+81 lines of new logic with no tests. At minimum, a unit test for extractTagTokens() and an integration-level test verifying that a proj:AIF query returns only entries literally containing that tag would be helpful.
2. as RetrievalResult type safety
const mapped = literalFiltered.map(
(result, index) =>
({
...result,
sources: {
bm25: { score: result.score, rank: index + 1 },
fused: { score: result.score },
},
}) as RetrievalResult,
);sources.vector is absent here. If any downstream code accesses sources.vector.score, it will throw. Worth checking that no consumer expects it, or add vector: undefined explicitly.
3. (Minor) Extensibility of tag patterns
Currently hardcoded to proj: only. Not a blocker, but consider making the tag prefix configurable (e.g., via retrieval config) or at least using an array of patterns, so adding env:, team:, etc. later doesn't require code changes.
Overall: approve once tests are added. Nice work! 👍
What
When a query contains tag-style tokens like
proj:AIF, treat it as an exact filter instead of a semantic query:proj:tokensWhy
Short tokens are prone to semantic false positives in hybrid retrieval because vector search dominates the candidate pool. For project tags, users usually expect exact matching, not related-content recall.
Scope / Compatibility
proj:tokensImplementation
src/retriever.ts:proj:[A-Za-z0-9._-]+)bm25OnlyRetrieval()helper with mustContain filterretrieve()Related issue
Fixes: #55