A simple and fast TF-IDF-based text search engine written in Go. It supports tokenization, log-scaled term frequency and inverse document frequency weighting, query vector construction, and cosine similarity ranking.
- https://docs.google.com/presentation/d/1ZmHTDNNzgtjNR6vbSmzhrVjvs5qJ-yWKpv9TrPyPbcE/edit?usp=sharing
- Tokenizes and indexes a set of short documents
- Computes smoothed log TF-IDF vectors
- Supports vectorized cosine similarity for ranking
- Returns top-k most relevant documents for a query
tfidf/
├── go.mod // Module definition
├── main.go // Entry point
├── model/ // Shared types (Document, Vector, Score)
├── pipeline/ // Tokenizer, TF-IDF logic, search engine
├── data/ // Used to extract documents from an example corpus
git clone https://github.com/JacobMcKenzieSmarty/tfidf.git
cd tfidfgo run main.godocs := []model.Document{
{0, "apple orange banana"},
{1, "banana apple"},
{2, "computer science and data"},
}
query := "banana apple"Output:
Rank 1: Doc 1 (score: 0.9765)
Rank 2: Doc 0 (score: 0.6123)
Rank 3: Doc 2 (score: 0.0000)
No external libraries — pure Go!
MIT License — feel free to use, modify, and contribute!
PRs welcome! Open issues or feature requests freely.