mindmeld.tokenizer module¶
This module contains the tokenizer.
-
class
mindmeld.tokenizer.
Tokenizer
(app_path=None, exclude_from_norm=None)[source]¶ Bases:
object
The Tokenizer class encapsulates all the functionality for normalizing and tokenizing a given piece of text.
-
static
create_tokenizer
(app_path=None)[source]¶ Creates the tokenizer for the app
Parameters: app_path (str, optional) -- MindMeld Application Path Returns: a tokenizer Return type: Tokenizer
-
fold_char_to_ascii
(char)[source]¶ Return the ASCII character corresponding to the folding token.
Parameters: char -- ASCII folding token Returns: a ASCII character Return type: char
-
fold_str_to_ascii
(text)[source]¶ Return the ASCII character corresponding to the folding token string.
Parameters: str -- ASCII folding token string Returns: a ASCII character Return type: char
-
get_char_index_map
(raw_text, normalized_text)[source]¶ Generates character index mapping from normalized query to raw query. The entity model always operates on normalized query during NLP processing but for entity output we need to generate indexes based on raw query.
The mapping is generated by calculating edit distance and backtracking to get the proper alignment.
Parameters: Returns: A mapping of character indexes from normalized query to raw query.
Return type:
-
multiple_replace
(text, compiled)[source]¶ Takes text and compiled regex pattern, does lookup for multi rematch.
Parameters: - text (str) -- The text to perform matching on
- compiled -- A compiled regex object that can be used for matching
Returns: The text with replacement specified by self.replace_lookup
Return type:
-
normalize
(text, keep_special_chars=True)[source]¶ Normalize a given text string and return the string with each token normalized.
Parameters: Returns: the original text string with each token in normalized form
Return type:
-
tokenize
(text, keep_special_chars=True)[source]¶ Tokenizes the input text, normalizes the token text, and returns normalized tokens.
Currently it does the following during normalization: 1. remove leading special characters except dollar sign and ampersand 2. remove trailing special characters except ampersand 3. remove special characters except ampersand when the preceding character is a letter and the following characters is a number 4. remove special characters except ampersand when the preceding character is a number and the following character is a letter 5. remove special characters except ampersand when both preceding and following characters are letters 6. remove special character except ampersand when the following character is '|' 7. remove diacritics and replace it with equivalent ascii character when possible
Note that the tokenizer also excludes a list of special characters used in annotations when the flag keep_special_chars is set to True
Parameters: Returns: A list of normalized tokens
Return type:
-
static