Description: Calculates a score based on a Gaussian (normal) distribution assumption of input data. It converts a raw score (x) into a standardized score within a specified range.
mean (float): The mean of the distribution. stddev (float): The standard deviation of the distribution. min_score (int): The minimum possible score. max_score (int): The maximum possible score. reverse (bool, optional): If True, reverses the scoring direction.
Returns: float: A score normalized to the specified range and clipped to ensure it remains within the min_score and max_score bounds.
Standardization: Converts a raw score x into a z-score using the formula (x - mean) / stddev. Normalization: Maps the z-score to a score within the specified range [min_score, max_score]. It linearly transforms z-scores between -3 and 3 to this range. Reversal Option: If reverse is set to True, the direction of scoring is inverted, making higher raw scores correspond to lower normalized scores. Clipping: Ensures the final score does not exceed the boundaries set by min_score and max_score.
Text Processing: Uses the spacy library to parse the given text into a document object, which organizes the text into tokens and sentence structures. Sentence Extraction: Extracts sentences from the document object and counts them.
Description: This function expands English contractions into their full form, which can be helpful for various natural language processing tasks that benefit from standardized text formats.
- The function uses regular expressions to replace common English contractions.
- Specific contractions handled include replacements for "won't" to "will not" and "can't" to "can not".
- More general patterns cover other common contractions like "n't" (not), "'re" (are), "'s" (is), etc.
Description: Evaluates the text based on the number of sentences, using a scoring system based on a Gaussian distribution of known sentence counts.
sentence_counts (list): Historical data of sentence counts. min_score (int): Minimum score to assign. max_score (int): Maximum score to assign.
Initial Check: Directly returns min_score if the sentence count is 10 or less. Data Filtering: Filters out sentence counts from historical data that are 10 or less (non-informative data). Statistical Analysis: Calculates the mean and standard deviation of the filtered historical sentence counts. Scoring: Applies the general_scorer_gaussian_assumption to the counted sentences, scoring them based on how they compare statistically to historical data.
Description: This function checks the spelling of words in a text and calculates the percentage of misspelled words. It first expands contractions using the decontracted function, then tokenizes the text, lemmatizes each word, and finally checks for spelling errors.
- The text is first decontracted to normalize contractions.
- nltk.word_tokenize is used for tokenizing the string into words.
- Each word is lemmatized using nltk.stem.WordNetLemmatizer to reduce it to its base or dictionary form.
- spellchecker.SpellChecker is utilized to identify words not recognized by its dictionary.
- The function calculates the percentage of words identified as misspelled compared to the total number of words.
Description: Evaluates the text based on the percentage of spelling mistakes, using a scoring system based on a Gaussian distribution of known spelling mistake rates.
mistakes_list (list): Historical data of spelling mistakes percentages. min_score (int): Minimum score to assign. max_score (int): Maximum score to assign.
- The function uses regular expressions to replace common English contractions.
- Specific contractions handled include replacements for "won't" to "will not" and "can't" to "can not".
- More general patterns cover other common contractions like "n't" (not), "'re" (are), "'s" (is), etc.