Skip to content

Conversation

@federetyk
Copy link
Contributor

Addresses #35

Description

This PR adds lexical baseline models to WorkRB for establishing performance bounds on ranking tasks. These baselines complement the existing neural embedding models (BiEncoderModel, JobBERTModel, etc.) by providing lower-bound reference points and enabling future two-stage retrieval pipelines with candidate generation followed by neural re-ranking.

Four models are introduced, all inheriting from ModelInterface and implementing the standard ranking/classification interface. The models accept but ignore ModelInputType parameters, as lexical methods are input-type agnostic. Classification is handled by delegating to ranking, following the same pattern as BiEncoderModel.

The implementations are adapted from the MELO Benchmark repository.

Changes:

  • Add BM25Model: BM25 Okapi probabilistic ranking using rank-bm25 library
  • Add TfIdfModel: TF-IDF with cosine similarity, supporting word-level or character n-gram tokenization
  • Add EditDistanceModel: Levenshtein ratio for string similarity using rapidfuzz library
  • Add RandomRankingModel: Random score generation for sanity checking, with optional seed for reproducibility
  • Add shared preprocessing: Unicode normalization (NFKD) and configurable lowercasing across all models
  • Add rank-bm25 and rapidfuzz dependencies to pyproject.toml
  • Add unit tests covering initialization and ranking computation
  • Export new models in src/workrb/models/__init__.py

Checklist

  • Added new tests for new functionality
  • Tested locally with example tasks
  • Code follows project style guidelines
  • Documentation updated
  • No new warnings introduced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant