mindmeld.text_preparation.normalizers module

This module contains Normalizers.

class mindmeld.text_preparation.normalizers.ASCIIFold[source]

Bases: mindmeld.text_preparation.normalizers.Normalizer

An ASCII Folding Normalizer.

fold_char_to_ascii(char)[source]

Return the ASCII character corresponding to the folding token.

Parameters:char -- ASCII folding token
Returns:a ASCII character
Return type:char
fold_str_to_ascii(text)[source]

Return the ASCII character corresponding to the folding token string.

Parameters:str -- ASCII folding token string
Returns:a ASCII character
Return type:char
static load_ascii_folding_table()[source]

Load mapping of ascii code points to ascii characters.

normalize(text)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text.
Return type:normalized_text (str)
class mindmeld.text_preparation.normalizers.Lowercase[source]

Bases: mindmeld.text_preparation.normalizers.Normalizer

Lowercase Normalizer Class.

normalize(text)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text.
Return type:normalized_text (str)
class mindmeld.text_preparation.normalizers.NFC[source]

Bases: mindmeld.text_preparation.normalizers.Normalizer

Unicode NFC Normalizer Class. (Canonical Decomposition, followed by Canonical Composition)

For more details: https://unicode.org/reports/tr15/#Norm_Forms

normalize(text)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text.
Return type:normalized_text (str)
class mindmeld.text_preparation.normalizers.NFD[source]

Bases: mindmeld.text_preparation.normalizers.Normalizer

Unicode NFD Normalizer Class. (Canonical Decomposition)

For more details: https://unicode.org/reports/tr15/#Norm_Forms

normalize(text)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text.
Return type:normalized_text (str)
class mindmeld.text_preparation.normalizers.NFKC[source]

Bases: mindmeld.text_preparation.normalizers.Normalizer

Unicode NFKC Normalizer Class. (Compatibility Decomposition, followed by Canonical Composition)

For more details: https://unicode.org/reports/tr15/#Norm_Forms

normalize(text)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text.
Return type:normalized_text (str)
class mindmeld.text_preparation.normalizers.NFKD[source]

Bases: mindmeld.text_preparation.normalizers.Normalizer

Unicode NFKD Normalizer Class. (Compatibility Decomposition)

For more details: https://unicode.org/reports/tr15/#Norm_Forms

normalize(text)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text.
Return type:normalized_text (str)
class mindmeld.text_preparation.normalizers.NoOpNormalizer[source]

Bases: mindmeld.text_preparation.normalizers.Normalizer

A No-Ops Normalizer.

normalize(text)[source]
Parameters:text (str) -- Input text.
Returns:Returns the original text.
Return type:normalized_text (str)
class mindmeld.text_preparation.normalizers.Normalizer[source]

Bases: abc.ABC

Abstract Normalizer Base Class.

normalize(text)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text.
Return type:normalized_text (str)
tojson()[source]

Method defined to obtain recursive JSON representation of a TextPreparationPipeline.

Parameters:None. --
Returns:JSON representation of Preprocessor (dict) .
class mindmeld.text_preparation.normalizers.NormalizerFactory[source]

Bases: object

Normalizer Factory Class

static get_normalizer(normalizer: str)[source]

A static method to get a Normalizer

Parameters:normalizer (str) -- Name of the desired Normalizer class
Returns:Normalizer Class
Return type:(Normalizer)
class mindmeld.text_preparation.normalizers.RegexNormalizerRule(pattern: str, replacement: str)[source]

Bases: mindmeld.text_preparation.normalizers.Normalizer

normalize(s)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text.
Return type:normalized_text (str)
tojson()[source]

Method defined to obtain recursive JSON representation of a TextPreparationPipeline.

Parameters:None. --
Returns:JSON representation of Preprocessor (dict) .
class mindmeld.text_preparation.normalizers.RegexNormalizerRuleFactory[source]

Bases: object

static get_default_regex_normalizer_rule(regex_normalizer: str)[source]

Creates a RegexNormalizerRule object based on the given rule and the current EXCEPTION_CHARS.

Parameters:regex_normalizer (str) -- Name of the desired RegexNormalizerRule
Returns:Default Regex Normalizer Rule
Return type:(RegexNormalizerRule)
static get_regex_normalizers(regex_norm_rules)[source]

A static method to get a RegexNormalizerRule from regex_norm_rules.

Parameters:regex_norm_rules (List[Dict], optional) --

Regex normalization rules represented as dictionaries. The example rule below removes any text in parentheses. {

"pattern": "(.+?)", "replacement": ""

}

Returns:
List of RegexNormalizerRule ojects
created from the regex_norm_rules_provided.
Return type:regex_normalizer_rules (List[RegexNormalizerRule])
EXCEPTION_CHARS = "\\@\\[\\]'"