feat: add lexical baseline models for ranking tasks #36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses #35
Description
This PR adds lexical baseline models to WorkRB for establishing performance bounds on ranking tasks. These baselines complement the existing neural embedding models (BiEncoderModel, JobBERTModel, etc.) by providing lower-bound reference points and enabling future two-stage retrieval pipelines with candidate generation followed by neural re-ranking.
Four models are introduced, all inheriting from
ModelInterfaceand implementing the standard ranking/classification interface. The models accept but ignoreModelInputTypeparameters, as lexical methods are input-type agnostic. Classification is handled by delegating to ranking, following the same pattern asBiEncoderModel.The implementations are adapted from the MELO Benchmark repository.
Changes:
BM25Model: BM25 Okapi probabilistic ranking usingrank-bm25libraryTfIdfModel: TF-IDF with cosine similarity, supporting word-level or character n-gram tokenizationEditDistanceModel: Levenshtein ratio for string similarity usingrapidfuzzlibraryRandomRankingModel: Random score generation for sanity checking, with optional seed for reproducibilityrank-bm25andrapidfuzzdependencies topyproject.tomlsrc/workrb/models/__init__.pyChecklist