mindmeld.text_preparation.text_preparation_pipeline module

This module contains a text Processing Pipeline.

exception mindmeld.text_preparation.text_preparation_pipeline.TextPreparationPipelineError[source]

Bases: Exception

class mindmeld.text_preparation.text_preparation_pipeline.TextPreparationPipeline(tokenizer: mindmeld.text_preparation.tokenizers.Tokenizer, stemmer: mindmeld.text_preparation.stemmers.Stemmer = None, preprocessors: List[mindmeld.text_preparation.preprocessors.Preprocessor] = None, normalizers: List[mindmeld.text_preparation.normalizers.Normalizer] = None, language: str = 'en')[source]

Bases: object

Pipeline Class for MindMeld's text processing.

append_normalizer(normalizer: mindmeld.text_preparation.normalizers.Normalizer)[source]

Add a normalizer to the Text Preparation Pipeline :param normalizer: Normalizer to append to current Normalizers. :type normalizer: List[Normalizer]

append_preprocessor(preprocessor: mindmeld.text_preparation.preprocessors.Preprocessor)[source]

Add a preprocessor to the Text Preparation Pipeline :param preprocessor: Preprocessor to append to current Preprocessors. :type preprocessor: List[Preprocessor]

static calc_unannotated_spans(text)[source]

Calculates the spans of text that exclude mindmeld entity annotations. For example, "{Lucien|person_name}" would return [(1,7)] since "Lucien" is the only text that is not the annotation.

Parameters:text (str) -- Original sentence with markup to modify.
Returns:
The list of spans where each span
is a section of the original text excluding mindmeld entity annotations of class type and markup symbols ("{", "|", "}"). The first element of the tuple is the start index and the second is the ending index + 1.
Return type:unannotated_spans (List[Tuple(int, int)])
static convert_token_idx_unannotated_to_annotated(tokens, unannotated_to_annotated_idx_map)[source]

In-place function that reverts the token start indices to the index of the character in the orginal text with annotations.

Parameters:
  • unannotated_to_annotated_idx_map (List[Tuple(int, int)]) -- A vector where the value at each index represents the mapping of the position of a single character in the unannotated text to the position in the original text.
  • tokens (List[dict]) -- List of tokens represented as dictionaries. With "start" indices referring to the unannotated text.
custom_preprocessors_exist()[source]

Checks if the current TextPreparationPipeline has preprocessors that are not simply the NoOpPreprocessor or None.

Returns:Whether atleast one custom preprocessor exists.
Return type:has_custom_preprocessors (bool)
static filter_out_space_text_tokens(tokens: List[Dict])[source]

Filter out any tokens where the text of the token only consists of space characters.

Parameters:tokens (List[Dict]) -- List of tokens represented as dictionaries
Returns:List of filtered tokens.
Return type:filtered_tokens (List[Dict])
static find_mindmeld_annotation_re_matches(text)[source]
Parameters:text (str) -- The string to find mindmeld annotation instances (" {entity_text|entity_type} ")
Returns:Regex match objects.
Return type:matches (List[sre.SRE_Match object])
static get_char_index_map(raw_text, normalized_text)[source]

Generates character index mapping from normalized query to raw query. The entity model always operates on normalized query during NLP processing but for entity output we need to generate indexes based on raw query.

The mapping is generated by calculating edit distance and backtracking to get the proper alignment.

Parameters:
  • raw_text (str) -- Raw query text.
  • normalized_text (str) -- Normalized query text.
Returns:

A mapping of character indexes from normalized query to raw query.

Return type:

dict

get_hashid()[source]

Method defined to obtain Hash value of TextPreparationPipeline.

Parameters:None. --
Returns:256 character hash representation of current TextPreparationPipeline config (str) .
get_normalized_tokens_as_tuples(text)[source]

Gets normalized tokens from input text and returns the result as a tuple.

Parameters:text (str) -- Text to normalize.
Returns:A Tuple of normalized tokens.
Return type:normalized_tokens_as_tuples (Tuple(str))
static modify_around_annotations(text, function)[source]

Applied a function around the mindmeld annotation.

function(pre_entity_text) + { + function(entity_text) + |entity_name}
  • function(post_entity_text)
Parameters:
  • text (str) -- Original sentence with markup to modify.
  • function (function) -- Function to apply around the annotation
Returns:

Text modified around annotations.

Return type:

modified_text (str)

normalize(text, keep_special_chars=None)[source]

Normalize Text. :param text: Text to normalize. :type text: str :param keep_special_chars: Whether to prevent special characters (such as @, [, ])

from being removed in the normalization process. No longer supported at the function level, can be specified in the config.
Returns:Normalized text.
Return type:normalized_text (str)
static offset_token_start_values(tokens: List[Dict], offset: int)[source]
Parameters:
  • tokens (List(Dict)) -- List of tokens represented as dictionaries.
  • offset (int) -- Amount to offset for the start value of each token
preprocess(text)[source]
Parameters:text (str) -- Input text.
Returns:Normalized Text. forward_map (Dict): Mapping from raw text to modified text. backward_map (Dict): Reverse mapping from modified text to raw text.
Return type:normalized_text (str)
stem_word(word)[source]

Gets the stem of a word. For example, the stem of the word 'fishing' is 'fish'.

Parameters:words (List[str]) -- List of words to stem.
Returns:List of stemmed words.
Return type:stemmed_words (List[str])
tojson()[source]

Method defined to obtain recursive JSON representation of a TextPreparationPipeline.

Parameters:None. --
Returns:JSON representation of TextPreparationPipeline (dict) .
tokenize(text, keep_special_chars=None)[source]
Parameters:
  • text (str) -- Input text.
  • keep_special_chars (bool) -- Whether to prevent special characters (such as @, [, ]) from being removed in the normalization process. No longer supported at the function level, can be specified in the config.
Returns:

List of tokens represented as dictionaries.

Return type:

tokens (List[dict])

tokenize_and_normalize(text)[source]
Parameters:text (str) -- Text to normalize.
Returns:
Normalized tokens represented as dictionaries.
For Example:
norm_token = {
"entity": "order", "raw_entity": "order", "raw_token_index": 1, "raw_start": 1

}

Return type:normalized_tokens (List[Dict])
tokenize_around_mindmeld_annotations(text)[source]

Applied a function around the mindmeld annotation.

tokenize(pre_entity_text) + { + tokenize(entity_text) + |entity_name}
  • tokenize(post_entity_text)
Parameters:text (str) -- Original sentence with markup to modify.
Returns:List of tokens represented as dictionaries.
Return type:tokens (List[dict])
tokenize_using_spacy(text)[source]

Wrapper function used before tokenizing with Spacy. Combines all unannoted text spans into a single string to pass to spacy for tokenization. Applies the correct offset to the resulting tokens to align with the annotated text. This optimization reduces the overall time needed for tokenization.

Parameters:text (str) -- Input text.
Returns:List of tokens represented as dictionaries.
Return type:tokens (List[dict])
static unannotated_to_annotated_idx_map(unannotated_spans)[source]

Create a vector mapping indexes from the unannotated text to the original text.

Parameters:unannotated_spans (List[Tuple(int, int)]) -- The list of spans where each span is a section of the original text excluding mindmeld entity annotations of class type and markup symbols ("{", "|", "}"). The first element of the tuple is the start index and the second is the ending index + 1.
Returns:
A vector where the value at
each index represents the mapping of the position of a single character in the unannotated text to the position in the original text.
Return type:unannotated_to_annotated_idx_map (List[int])
language
normalizers
preprocessors
stemmer
tokenizer
class mindmeld.text_preparation.text_preparation_pipeline.TextPreparationPipelineFactory[source]

Bases: object

Creates a TextPreparationPipeline object.

static create_default_text_preparation_pipeline()[source]

Default text_preparation_pipeline used across MindMeld internally.

static create_from_app_config(app_path)[source]

Static method to create a TextPreparation pipeline based on the specifications in the config.

Parameters:app_path (str) -- The application path.
Returns:A TextPreparationPipeline class.
Return type:TextPreparationPipeline
static create_from_app_path(app_path)[source]

Static method to create a TextPreparationPipeline instance from an app_path. If a custom text_preparation_pipeline is passed into the Application object in the app_path/__init__.py file then it will be used. Otherwise, a text_preparation_pipeline will be created based on the specifications in the config.

Parameters:app_path (str) -- The application path.
Returns:A TextPreparationPipeline class.
Return type:TextPreparationPipeline
static create_text_preparation_pipeline(language: str = 'en', preprocessors: Tuple[Union[str, mindmeld.text_preparation.preprocessors.Preprocessor]] = None, regex_norm_rules: List[Dict] = None, keep_special_chars: str = None, normalizers: Tuple[Union[str, mindmeld.text_preparation.normalizers.Normalizer]] = None, tokenizer: Union[str, mindmeld.text_preparation.tokenizers.Tokenizer] = None, stemmer: Union[str, mindmeld.text_preparation.stemmers.Stemmer] = None)[source]

Static method to create a TextPreparationPipeline instance.

Parameters:
  • language (str, optional) -- Language as specified using a 639-1/2 code.
  • preprocessors (Tuple[Union[str, Preprocessor]]) -- List of preprocessor class names or objects.
  • regex_norm_rules (List[Dict]) -- List of regex normalization rules represented as dictionaries. ({"pattern":<pattern>, "replacement":<replacement>})
  • normalizers (Tuple[Union[str, Preprocessor]]) -- List of normalizer class names or objects.
  • tokenizer (Union[str, Tokenizer]) -- Class name of Tokenizer to use or Tokenizer object.
  • stemmer (Union[str, Stemmer]) -- Class name of Stemmer to use or Stemmer object.
Returns:

A TextPreparationPipeline class.

Return type:

TextPreparationPipeline

class mindmeld.text_preparation.text_preparation_pipeline.TextPreparationPipelineJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.encoder.JSONEncoder

Custom Encoder class defined to obtain recursive JSON representation of a TextPreparationPipeline.

Parameters:None. --
Returns:Custom JSON Encoder class (json.JSONEncoder) .
default(o)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)