Educational project С++17
Read in other languages: English, Русский
This is a search engine for lost animals. To do this, you need a search query and the documents that are being searched. Let's say someone found a cat on the street and posted an ad about it. So a document appeared in the system "A white fluffy cat with a blue collar was found in the Perovo district in Moscow phone for communication 89222334455". The owner of the cat sent a request "White cat blue collar" and was able to return the pet.In general, the search engine can work with any queries.
To build this project on linux you need:
1)If you don't have Cmake installed, install Cmake
2)If the "Debug" or "Release" folders are not created:
mkdir Debug
mkdir Release
3)Run the command for Debug and Release conf:
cmake -E chdir Debug/ cmake -G "Unix Makefiles" ../ -DCMAKE_BUILD_TYPE:STRING=Debug
cmake -E chdir Release/ cmake -G "Unix Makefiles" ../ -DCMAKE_BUILD_TYPE:STRING=Release
4)Build command:
cmake --build Debug/.
cmake --build Release/.
5)To Run program - go to the debug (cd Debug/) or release (cd Release/) folder and run:
./search_server
ALL in one command(Release):
mkdir Release; cmake -E chdir Release/ cmake -G "Unix Makefiles" ../ -DCMAKE_BUILD_TYPE:STRING=Release && cmake --build Release/.
- Installing and configuring all required components in the development environment to run the application
- The use case is shown in main.cpp and test_example_functions.h
Methods of the SearchServer class:
-
The constructor accepts a string of stop-words, for example:
"in at and"s
The stop-word in the query is not taken into account when searching. -
Adding documents to the search engine.
void AddDocument(int document_id, string_view document,DocumentStatus status, const vector<int> &ratings);
document - string as: "funny pet and nasty -rat"s
where "funny pet nasty" - the words that will be searched for
"and" - the stop word specified in the SearchServer constructor
"-rat" - mius-word
Mnius-words exclude documents containing such words from the search results.
Possible DocumentStatus: ACTUAL, IRRELEVANT, BANNED, REMOVED
ratings - Each document at the input has a set of user ratings.
The first digit is the number of ratings
For example:{4 5 -12 2 1};
-
Document search in the search server and ranking by TF-IDF
There are 6 ways to call the function 3 multithreaded (ExecutionPolicy) and 3 single-threaded
FindTopDocuments (ExecutionPolicy,query)
FindTopDocuments (ExecutionPolicy,query,DocumentStatus)
FindTopDocuments (ExecutionPolicy,query,DocumentPredicate)
FindTopDocuments (query)
FindTopDocuments (query,DocumentStatus)
FindTopDocuments (query,DocumentPredicate)
Returns vector matching by query
The usefulness of words is evaluated by the concept of inverse document frequency or IDF.
This value is a property of a word, not a document.
The more documents have a word in them, the lower the IDF.
Above, place documents where the search word occurs more than once.
Here you need to calculate the term frequency or TF.
For a specific word and a specific document, this is the share that this word occupies among all. -
GetDocumentCount()- returns the number of documents in the search server -
begin и end- They will return iterators. The iterator will give access to the id of all documents stored in the search server. -
tuple<std::vector<std::string_view>, DocumentStatus> MatchDocument(raw_query, document_id)
Returns:The first object is a vector of query words that were found in the document_id document,
and the second object is the status of the document -
Method of obtaining word frequencies by document id:
map<string, double> GetWordFrequencies(document_id) -
Deleting a document from the search server
RemoveDocument(document_id)
RemoveDocument(ExecutionPolicy,document_id)
RemoveDocument(ExecutionPolicy, document_id)
-
Output to the standard input/output stream information about the document
PrintDocument(document)
PrintMatchDocumentResult(document_id,vector<std::string_view> &words, DocumentStatus) -
Класс
LogDuration- allows profiling -
Paginate()- Allows you to split the results into pages -
RemoveDuplicates(SearchServer)- allows you to remove duplicate documents -
As well as other functions that provide input data processing.
- C++17(STL)
- GCC (MinG w64) 11.2.0
- Add UI
- Add the ability to read documents from files
- Add support for JSON requests/responses
-
The project shows knowledge of the basic principles of C++ programming:
a. Numbers, strings, symbols, data input/output in console, conditions, loops.
b. The use of basic algorithms .
c. Using structures, classes, lambda functions, creating tuples
d. Parsign of lines with output to the screen
String parsing with saving in the container vector
Allocation of string parsing in a function for further reuse
(Working with functions , Arguments, Return result
Query Parsing - Links, Constancy, Deep Copying
Stop word processing - container set
Adding documents to the search engine, Document Search,
Calculating the relevance of found documents - containers: pair, map
e. Use of Templates and Specialization of templates
f. Creating and using macros
g. Overloading operators
h. Handling exceptions
i. Application of the optional class
g. Iterators
k. Recursion
l. Stack, Dec
m. Working with standard input/output streams
Working with file streams
n. Static, Automatic and Dynamic placement of objects in memory
o. Parallel work using the bible library
p. Scan algorithms
q. Asynchronous calculations using the library
r. Race state protection: mutex, lock_guard, atomic-types -
Calculation of term frequency and inverse document frequency
-
Unit testing
-
Decomposition and debugging
-
Creating multi-file projects
-
Profiling
-
Evaluation of the complexity of the program
-
MapReduce is a concept in which the algorithm is divided into two stages:
independent filtering and transformation of individual elements (map or transform);
grouping of the results of the previous stage (reduce).
When there are too many requests, it takes time to process them. Requests waiting to be processed can simply "sit" in the queue and wait for the service handler to get to them.
To store only the necessary queries. For example, you want to know what queries users have sent to the search server. But only relevant requests for the last day are important. That is, you are interested in the time of sending. It is not required to store requests older than a day.
One request from the user comes every minute. That is, the maximum number of requests to be stored is the number of minutes per day (1440). The appearance of 1441 requests will mean that the day has passed, the first request of the previous day is no longer interesting to us and can be deleted. To implement such a mechanism, it is convenient to use deque. A new query will be easily inserted at the end, and an outdated query will be deleted from the beginning.
Iterators can be used far from the most obvious cases. Imagine that a search server contains hundreds of thousands or even millions of documents. Thousands of them fit the user's request. In this case, displaying all the requests on the screen at once would not be a good idea. You need to break the results into pages.
Iterators will come in handy. Suppose we have a container. Then one page is a certain range of a certain size from this container. The first element is included in the page, and the last one is not. But the last one will be the first element on the next page. That is, we can get a container with the results, and then create a vector of ranges based on it, where the range will be just a pair of iterators. The first iterator will point to the beginning of the page, and the second one will point to its end.
The Paginator class can be responsible for the separation by pages.
The FindTopDocuments method can accept a more universal document filter, a predicate function, instead of the status. It, in turn, accepts the document id, status and rating and returns bool. Filtering of documents should be done before cutting off the top of five pieces. Calling Find Top Documents without the second argument should search only for current documents.
Search engines face a problem — mirrors. These are copies of the site. Their number on the Internet can reach tens or hundreds. To ensure that the first pages of the search results do not consist of copies of the same site, you need to develop a deduplicator. It deletes copies from the search server. Duplicates are documents that have the same sets of words that occur. Frequency matching is optional. Word order is unimportant, and stop words are ignored. The function uses only the search server methods available at that moment.
If duplicate documents are detected, the function should delete a document with a large id from the search server, and at the same time report the id of the deleted document in accordance with the output format.
The ProcessQueries function parallelizes the processing of multiple queries to the search engine.
The ProcessQueriesJoined function, like the ProcessQueries function, parallelizes the processing of several queries to the search engine, but returns a set of documents in a flat form.
Often it is really enough to parallelize the processing of several requests — and the thing is in the hat. But there is another situation: one request is processed for too long, and its implementation needs to be parallelized.
A multithreaded version of the RemoveDocument method in addition to the single-threaded one.
A multithreaded version of the MatchDocument method in addition to the single-threaded one.
A multithreaded version of the FindTopDocuments method in addition to the single-threaded one.
The constructors of the SearchServer class throw an invalid_argument exception if any of the passed stop words contains invalid characters, that is, characters with codes from 0 to 31.
The addDocument method throws an invalid_argument exception in the following situations:
An attempt to add a document with a negative id;
Attempt to add a document with the id of a previously added document;
The presence of invalid characters (with codes from 0 to 31) in the text of the document being added.
FindDocument methods throws an invalid_argument exception in the following situations:
There are invalid characters with codes from 0 to 31 in the words of the search query;
The presence of more than one minus sign before words that should not be in the required documents, for example,
fluffy --cat. In the middle of the words, cons are resolved, for example: ivan-tea.
The absence of text after the "minus" symbol: in the search query: "fluffy -".
The MatchDocument method throws an invalid_argument exception in the same situations as the FindDocument method.
The GetDocumentId method throws an out_of_range exception if the index of the transmitted document is outside the allowed range (0; number of documents).
The begin and end methods are defined. They will return iterators. The iterator will give access to the id of all documents stored in the search server.