-
Kaggle Setup and Data Download
Kaggle credentials are configured, and competition files (documents, queries, qrels) are downloaded via the Kaggle CLI. -
Data Loading and Inspection
All downloaded JSONL/JSON files are read into Python lists or dictionaries, and basic stats (document and query counts) are displayed. -
Evaluation Metric (P-Found)
A functionpfound_scoreis defined to measure retrieval performance, demonstrating how to compute a score based on ranked predictions. -
Text Preprocessing
Titles plus a portion of content are tokenized, stemmed, and filtered for stopwords. A dictionary of document frequencies (df) is built, and very low-frequency tokens are removed. -
TF-IDF Construction
A sparse TF matrix is created for each document, IDF values are computed, and both are combined to form the final TF-IDF matrix. -
Query Processing and Similarity Computation
Queries undergo the same tokenization and stemming. Their term frequencies are assembled into a sparse matrix, and cosine-like similarity scores are calculated by multiplying document TF-IDF by query term frequencies. -
Submission File Creation
(Query, document) pairs and their computed scores are gathered into a dataframe and saved as a CSV file for submission.
polinak1r/Document-Ranking-Information-Retrieval
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|