mindmeld.tokenizer module¶

This module contains the tokenizer.

class mindmeld.tokenizer.Tokenizer(app_path=None, exclude_from_norm=None)[source]¶

Bases: object

The Tokenizer class encapsulates all the functionality for normalizing and tokenizing a given piece of text.

static create_tokenizer(app_path=None)[source]¶

Creates the tokenizer for the app

Parameters:	app_path (str, optional) -- MindMeld Application Path
Returns:	a tokenizer
Return type:	Tokenizer

fold_char_to_ascii(char)[source]¶

Return the ASCII character corresponding to the folding token.

Parameters:	char -- ASCII folding token
Returns:	a ASCII character
Return type:	char

fold_str_to_ascii(text)[source]¶

Return the ASCII character corresponding to the folding token string.

Parameters:	str -- ASCII folding token string
Returns:	a ASCII character
Return type:	char

get_char_index_map(raw_text, normalized_text)[source]¶

Generates character index mapping from normalized query to raw query. The entity model always operates on normalized query during NLP processing but for entity output we need to generate indexes based on raw query.

The mapping is generated by calculating edit distance and backtracking to get the proper alignment.

Parameters:	raw_text (str) -- Raw query text. normalized_text (str) -- Normalized query text.
Returns:	A mapping of character indexes from normalized query to raw query.
Return type:	dict

static load_ascii_folding_table()[source]¶: Load mapping of ascii code points to ascii characters.

multiple_replace(text, compiled)[source]¶

Takes text and compiled regex pattern, does lookup for multi rematch.

Parameters:	text (str) -- The text to perform matching on compiled -- A compiled regex object that can be used for matching
Returns:	The text with replacement specified by self.replace_lookup
Return type:	str

normalize(text, keep_special_chars=True)[source]¶

Normalize a given text string and return the string with each token normalized.

Parameters:	text (str) -- The text to normalize keep_special_chars (bool) -- If True, the tokenizer excludes a list of special characters used in annotations
Returns:	the original text string with each token in normalized form
Return type:	str

tokenize(text, keep_special_chars=True)[source]¶

Tokenizes the input text, normalizes the token text, and returns normalized tokens.

Currently it does the following during normalization: 1. remove leading special characters except dollar sign and ampersand 2. remove trailing special characters except ampersand 3. remove special characters except ampersand when the preceding character is a letter and the following characters is a number 4. remove special characters except ampersand when the preceding character is a number and the following character is a letter 5. remove special characters except ampersand when both preceding and following characters are letters 6. remove special character except ampersand when the following character is '|' 7. remove diacritics and replace it with equivalent ascii character when possible

Note that the tokenizer also excludes a list of special characters used in annotations when the flag keep_special_chars is set to True

Parameters:	text (str) -- The text to normalize keep_special_chars (bool) -- If True, the tokenizer excludes a list of special characters used in annotations
Returns:	A list of normalized tokens
Return type:	list

static tokenize_raw(text)[source]¶

Identify tokens in text and create normalized tokens that contain the text and start index.

Parameters:	text (str) -- The text to normalize
Returns:	A list of normalized tokens
Return type:	list