This is a search engine built on the full corpus of wikipedia (~60GB). The link to the dataset can be found here ftp://10.4.17.131/Datasets/IRE_Monsoon_2017/WikiSearch/
- less than 3 words, time to fetch results is < 1s
- between 3 and 7 words, time to fetch results is Around 2-3s
- Search.py - Main file containing all the code for Query Processing
- Driver.py - Main file which runs the code for Indexing
- Preprocess.py - File containing all functions related to XML parsing and text preprocessing.
- MultiwayMerge.py - File with functions related to k-way mergesort algorithm.
- MultiLevelIndexing.py - File containing functions related to making offset files and secondary index.
- Indexing_defaultdict.py - File which performs the actual indexing
- TermHandling.py - File with functions which split the term-term_id map into small files and sorts and performs external merge on these files. Also makes a secondary index to access this map.
- Index - Initial index gets created here
- IndexMerge - They get merged here
- PrimaryIndex - Merged file is split into smaller files here
- PrimaryOffset - Offset files for these merged files are made here
- SecondaryIndex - Secondary index file for these offset files are made here
- PageTitleMap - Files containing page_id-title map are made here
- TermIdMap - Small files of the term-term_id map are made here
- TermIdMerge - These small files are merged here and split into many files
- TermIdMapSecondary - Secondary index for these files are present here.
- stop_words.csv - A csv file containing all the stop words in the current directory of the code
- full_wiki.xml - The XML file containing the full data of wikipedia
Run Search.py - An infinite loop runs expecting queries.
-
Field query - Assuming that fields are small letters(b, i, c, t, r, e) followed by colon and the fields are space separated. “b:sachin i:2003 c:sports”
-
Boolean query - Assuming that the boolean operators are given in capitals (AND, OR, NOT) and remaining words are space separated. “Sachin AND Dhoni NOT Kohli”
-
Normal query - Any sequence of words that doesn’t satisfy the above conditions is considered a normal query. “Sachin Tendulkar”