Skip to content

fix(retrieval): make proj: tag queries exact (BM25-only + mustContain)#56

Open
1034378361 wants to merge 1 commit intowin4r:mainfrom
1034378361:fix/tag-query-fts-only
Open

fix(retrieval): make proj: tag queries exact (BM25-only + mustContain)#56
1034378361 wants to merge 1 commit intowin4r:mainfrom
1034378361:fix/tag-query-fts-only

Conversation

@1034378361
Copy link
Contributor

What

When a query contains tag-style tokens like proj:AIF, treat it as an exact filter instead of a semantic query:

  • Use BM25 (FTS-only) first
  • Hard filter results: mustContain all extracted proj: tokens
  • If no literal matches, fall back to existing hybrid retrieval (to avoid returning nothing)

Why

Short tokens are prone to semantic false positives in hybrid retrieval because vector search dominates the candidate pool. For project tags, users usually expect exact matching, not related-content recall.

Scope / Compatibility

  • Only affects queries containing proj: tokens
  • Normal natural-language queries keep the original hybrid path

Implementation

src/retriever.ts:

  • Added tag token extraction (proj:[A-Za-z0-9._-]+)
  • Added bm25OnlyRetrieval() helper with mustContain filter
  • Hooked in early in retrieve()

Related issue

Fixes: #55

Copy link
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The approach is sound — routing tag-style queries to BM25 + mustContain is a clean solution for the false-positive problem described in #55.

I verified locally that LanceDB FTS treats : as a tokenizer (not a Tantivy field separator), so the BM25 stage correctly returns candidates and mustContain filters them down. The core logic works.

A few suggestions before merging:

1. Missing test coverage

+81 lines of new logic with no tests. At minimum, a unit test for extractTagTokens() and an integration-level test verifying that a proj:AIF query returns only entries literally containing that tag would be helpful.

2. as RetrievalResult type safety

const mapped = literalFiltered.map(
  (result, index) =>
    ({
      ...result,
      sources: {
        bm25: { score: result.score, rank: index + 1 },
        fused: { score: result.score },
      },
    }) as RetrievalResult,
);

sources.vector is absent here. If any downstream code accesses sources.vector.score, it will throw. Worth checking that no consumer expects it, or add vector: undefined explicitly.

3. (Minor) Extensibility of tag patterns

Currently hardcoded to proj: only. Not a blocker, but consider making the tag prefix configurable (e.g., via retrieval config) or at least using an array of patterns, so adding env:, team:, etc. later doesn't require code changes.

Overall: approve once tests are added. Nice work! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce false positives for tag-style queries like proj:AIF (BM25-only + mustContain)

2 participants