Working with the Text Preparation Pipeline

The Text Preparation Pipeline

  • is MindMeld's text processing module which handles the preprocessing, tokenization, normalization, and stemming of raw queries
  • has extensible components that can be configured based on language-specific requirements
  • offers components that integrate functionality from Spacy, NLTK and Regex

The TextPreparationPipeline processes text in the following order:

../_images/text_preparation_pipeline.png

The diagram below is a visual example of a TextPreparationPipeline that uses Spacy for tokenization, Regex for normalization and NLTK for stemming.

../_images/text_preparation_pipeline_sample_breakdown.png

TextPreparationPipeline Configuration

The DEFAULT_TEXT_PREPARATION_CONFIG is shown below. Observe that various normalization classes have been pre-selected by default. To modify the selected components (or to use a subset of the normalization steps), duplicate the default config and rename it to TEXT_PREPARATION_CONFIG. Place this custom config in the config.py file for your application. If a custom configuration is not defined, a default is used. The config below is an example of a default config specifically for English. The normalizers component includes 12 default MindMeld regex normalization rules in addition to Lowercase and ASCIIFold.

DEFAULT_EN_TEXT_PREPARATION_CONFIG = {
    "preprocessors": [],
    "tokenizer": "SpacyTokenizer",
    "normalizers": [
        'RemoveAposAtEndOfPossesiveForm',
        'RemoveAdjacentAposAndSpace',
        'RemoveBeginningSpace',
        'RemoveTrailingSpace',
        'ReplaceSpacesWithSpace',
        'ReplaceUnderscoreWithSpace',
        'SeparateAposS',
        'ReplacePunctuationAtWordStartWithSpace',
        'ReplacePunctuationAtWordEndWithSpace',
        'ReplaceSpecialCharsBetweenLettersAndDigitsWithSpace',
        'ReplaceSpecialCharsBetweenDigitsAndLettersWithSpace',
        'ReplaceSpecialCharsBetweenLettersWithSpace',
        'Lowercase',
        'ASCIIFold'
    ],
    "regex_norm_rules": [],
    "stemmer": "EnglishNLTKStemmer",
    "keep_special_chars": r"\@\[\]'"
}

In general, the Tokenizer and Stemmer components are dynamically selected based on the language of the application, if they are not explicitly defined in the config. The table below explains these defaults:

Config Element Default Condition
Preprocessors None Always
Normalizers

12 Default MindMeld Regex Rules

Lowercase

ASCII Fold

Always
Regex Norm Rules None Always
Tokenizer SpacyTokenizer If the language is supported by Spacy: English (en), Spanish (es), French (fr), German (de), Danish (da), Greek (el), Portuguese (pt), Lithuanian (lt), Norwegian Bokmal (nb), Romanian (ro), Polish (pl), Italian (it), Japanese (ja), Chinese (zh), or Dutch (nl).
  WhiteSpaceTokenizer If the language is not supported by Spacy.
Stemmer EnglishNLTKStemmer If the language is English.
  SnowballStemmer If the language is supported by NLTK's SnowballStemmer: Danish (da), Dutch (nl), Finnish (fi), French (fr), German (de), Hungarian (hu), Italian (it), Norwegian (nb), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es) and Swedish (sv).
  NoOpStemmer If the language is not English and is not supported by NLTK's SnowballStemmer.
keep_special_chars @, [, ], ' Always

Let's define the parameters in the TextPreparationPipeline config:

'preprocessors' (List[str]): The preprocessor class to use. (Mindmeld does not currently offer default preprocessors.)

'tokenizer' (str): The tokenization method to split raw queries.

'normalizers' (List[str]): List of normalization classes. The text will be normalized sequentially given the order of the normalizers specified.

'keep_special_chars' (str): String containing characters to be skipped when normalizing/filtering special characters. This only applies for a subset of default MindMeld normalization rules.

'regex_norm_rules' (List[Dict]): Regex normalization rules represented as dictionaries. Each rule should have the key "pattern" and "replacement" which map to a regex pattern (str) and replacement string, respectively. For example: { "pattern": "_", "replacement": " " }.

'stemmer (str): The stemmer class to reduce words to their word stem.

Note

If regex_norm_rules are specified in the config they will be applied before other normalization rules. This includes the default normalization rules if normalization rules are not explicitly defined in the config.

Preprocessing

By preprocessing text, we can make modifications to raw text before it is processed through the TextPreparationPipeline. Examples of some common preprocessing tasks include spelling correction, punctuation removal, handling special characters, and other kinds of application-specific text normalization. Currently, MindMeld does not offer pre-built processors, however, the pipeline does support custom preprocessors.

Creating a Custom Preprocessor

This section includes boilerplate code to build a custom preprocessor class. Let's build a simple ASRPreprocessor class that corrects mistranscriptions which is a common problem with Automatic Speech Recognition systems. A custom preprocessor must extend from MindMeld's abstract Preprocessor class:

from abc import ABC, abstractmethod

class Preprocessor(ABC):
    """
    Base class for Preprocessor object
    """
    @abstractmethod
    def process(self, text):
        """
        Args:
            text (str)

        Returns:
            (str)
        """
        pass

Now that we have a better understanding of the base class. Let's create a simple ASRPreprocessor class which implements the abstract preprocess method and replaces any substring of "croissant ready" with the intended name, "prasanth reddy".

from mindmeld.text_preparation.preprocessors import Preprocessor

class ASRPreprocessor(Preprocessor):
    """ Sample Preprocessor Class """

    def process(self, text):
        return text.replace("croissant ready", "Prasanth Reddy")

This would transform the transcript "Let's start the meeting with croissant ready." to "Let's start the meeting with Prasanth Reddy." The steps to use a custom Preprocessor in your application are explained here.

Tokenization

Tokenization is the process of splitting the text of a query into smaller chunks. MindMeld offers a number of ready-made tokenizers that you can use for your application. MindMeld supports the development of custom tokenizers as well.

White Space Tokenizer

The WhiteSpaceTokenizer splits up a sentence by whitespace characters. For example, we can run:

from mindmeld.text_preparation.tokenizers import WhiteSpaceTokenizer

sentence = "MindMeld is a Conversational AI Platform."
white_space_tokenizer = WhiteSpaceTokenizer()
tokens = white_space_tokenizer.tokenize(sentence)
print([t['text'] for t in tokens])

We find that the resulting tokens are split by whitespace as expected.

['MindMeld', 'is', 'a', 'Conversational', 'AI', 'Platform.']

Character Tokenizer

The CharacterTokenizer splits up a sentence by the individual characters. This can be helpful for languages such as Japanese. Let's break apart the Japanese translation for the phrase "The tall man":

from mindmeld.text_preparation.tokenizers import CharacterTokenizer

sentence_ja = "背の高い男性"
character_tokenizer = CharacterTokenizer()
tokens = character_tokenizer.tokenize(sentence_ja)
print([t['text'] for t in tokens])

We see that the original text is split at the character level.

['背', 'の', '高', 'い', '男', '性']

Letter Tokenizer

The LetterTokenizer splits text into a separate token if the character proceeds a space, is a non-latin character, or is a different unicode category than the previous character.

This can be helpful to keep characters of the same type together. Let's look at an example with numbers in a Japanese sentence, "1年は365日". This sentence translates to "One year has 365 days".

from mindmeld.text_preparation.tokenizers import LetterTokenizer

sentence_ja = "1年は365日"
letter_tokenizer = LetterTokenizer()
tokens = letter_tokenizer.tokenize(sentence_ja)
print([t['text'] for t in tokens])

We see that the original text is split at the character level for non-latin characters but the number "365" remains as an unsegmented token.

['1', '年', 'は', '365', '日']

Spacy Tokenizer

The SpacyTokenizer splits up a sentence using Spacy's language models. Supported languages include English (en), Spanish (es), French (fr), German (de), Danish (da), Greek (el), Portuguese (pt), Lithuanian (lt), Norwegian Bokmal (nb), Romanian (ro), Polish (pl), Italian (it), Japanese (ja), Chinese (zh), Dutch (nl). If the required Spacy model is not already present it will automatically downloaded during runtime. Let's use the SpacyTokenizer to tokenize the Japanese translation of "The gentleman is gone, no one knows why it happened!":

from mindmeld.text_preparation.tokenizers import SpacyTokenizer

sentence_ja = "背の高い男性"
spacy_tokenizer_ja = SpacyTokenizer(language="ja", spacy_model_size="sm")
tokens = spacy_tokenizer_ja.tokenize(sentence_ja)

We see that the original text is split semantically and not simply by whitespace.

['背', 'の', '高い', '男性']

Creating a Custom Tokenizer

This section includes boilerplate code to build a custom tokenizer class. Let's rebuild a CharacterTokenizer class that creates a token for each character in a string as long as the character is not a space. A custom tokenizer must extend from MindMeld's abstract Tokenizer class:

from abc import ABC, abstractmethod

class Tokenizer(ABC):
    """Abstract Tokenizer Base Class."""

    @abstractmethod
    def tokenize(self, text):
        """
        Args:
            text (str): The text to tokenize.
        Returns:
            tokens (List[Dict]): List of tokenized tokens which a represented as dictionaries.
                Keys include "start" (token starting index), and "text" (token text).
                For example: [{"start": 0, "text":"hello"}]
        """
        raise NotImplementedError("Subclasses must implement this method")

Note that any MindMeld tokenizer must return the final tokens as a list of dictionaries. Where each dictionary represents a single token and contains the "start" index of the token and the "text" of the token. Here is an example of the expected output for the tokens generated when tokenizing the phrase "Hi Andy": [{"start": 0, "text":"Hi"}, {"start": 3, "text":"Andy"}]. The starting indices here refer to the starting indices in the processed text. With this in mind, let's recreate MindMeld's CharacterTokenizer class which converts every individual character in a string into a separate token while skipping spaces.

from mindmeld.text_preparation.tokenizers import Tokenizer

class CharacterTokenizer(Tokenizer):
    """A Tokenizer that splits text at the character level."""

    def tokenize(self, text):
        tokens = []
        for idx, char in enumerate(text):
            if not char.isspace():
                tokens.append({"start": idx, "text": char})
        return tokens

This tokenizes the phrase "Hi Andy" in the following manner:

[
    {'start': 0, 'text': 'H'},
    {'start': 1, 'text': 'i'},
    {'start': 3, 'text': 'A'},
    {'start': 4, 'text': 'n'},
    {'start': 5, 'text': 'd'},
    {'start': 6, 'text': 'y'}
]

The steps to use a custom Tokenizer in your application are explained here.

Normalization

Normalization is the process of transforming text into a standardized form. MindMeld supports the use of multiple normalizers to be applied to the original raw query in a sequential manner. MindMeld offers a number of pre-built normalizers that can be specified in the config.py file. MindMeld also supports the development of custom normalizers to meet application-specific requirements.

Note

Normalization and Tokenization are conducted around MindMeld's entity annotations. For example, let's look at the query, "Where is {Andy Neff|person_name} located?". Let's assume our normalization method is to use the Uppercase value of each character. The TextPreparationPipeline will normalize the query to become the following: "WHERE IS {ANDY NEFF|person_name} LOCATED?". Notice that the entity name in the entity annotation is not modified. A similar process happens during tokenization. Another way to think of this, is that the entity annotations are "temporarily removed" before normalization and then added back in.

Default Regex Normalization

By default, MindMeld uses 12 Regex-based normalization rules when normalizing texts (in addition to Lowercase and ASCIIFold). Descriptions for these 12 rules can be found in the table below.

Regex Normalization Rule Description Example Input Example Output
RemoveAposAtEndOfPossesiveForm Removes any apostrophe following an 's' at the end of a word. "dennis' truck" "dennis truck"
RemoveAdjacentAposAndSpace Removes apostrophes followed by a space character and apostrphes that precede a space character. "havana' " "havana"
RemoveBeginningSpace Removes extra spaces at the start of a word. " MindMeld" "MindMeld"
RemoveTrailingSpace Removes extra spaces at the end of a word. "MindMeld " "MindMeld"
ReplaceSpacesWithSpace Replaces multiple consecutive spaces with a single space. "How are you?" "How are you?"
ReplaceUnderscoreWithSpace Replaces underscore with a single space. "How_are_you?" "How are you?"
SeparateAposS Adds a space before 's. "mindmeld's code" "mindmeld 's code"
ReplacePunctuationAtWordStartWithSpace Replaces special characters infront of words with a space. "HI %#++=-=SPERO" "HI SPERO"
ReplacePunctuationAtWordEndWithSpace Replaces special characters following words with a space. "How%+=* are++- you^^%" "How are you"
ReplaceSpecialCharsBetweenLettersAndDigitsWithSpace Replaces special characters between letters and digits with a space. "Coding^^!#%24 hours#%7 days" "Coding 24 hours 7 days"
ReplaceSpecialCharsBetweenDigitsAndLettersWithSpace Replaces special characters between digits and letters with a space. "Coding 24^^!#%%hours 7##%days" "Coding 24 hours 7 days"
ReplaceSpecialCharsBetweenLettersWithSpace Replaces special characters between letters and letters with a space. "Coding all^^!#%%hours seven##%days" "Coding all hours seven days"

The last 5 rules (ReplacePunctuationAtWordStartWithSpace, ReplacePunctuationAtWordEndWithSpace, ReplaceSpecialCharsBetweenLettersAndDigitsWithSpace, ReplaceSpecialCharsBetweenDigitsAndLettersWithSpace, ReplaceSpecialCharsBetweenLettersWithSpace) above remove special characters in different contexts. These special characters can be specified in the config using the key, keep_special_chars. By default, keep_special_chars includes @, [, ] and ' represented as a single string. A custom set of special characters can be specified in config.py.

Lowercase Normalization

The Lowercase normalizer converts every character in a string to its lowercase equivalent. For example:

from mindmeld.text_preparation.normalizers import Lowercase

sentence = "I Like to Run!"
lowercase_normalizer = Lowercase()
normalized_text = lowercase_normalizer.normalize(sentence)
print(normalized_text)

As expected, this would display the following normalized text:

'i like to run!'

ASCII Fold Normalization

The ASCIIFold normalizer converts numeric, symbolic and alphabetic characters which are not in the first 127 ASCII characters (Basic Latin Unicode block) into an ASCII equivalent (if possible).

For example, we can normalize the following Spanish sentence with several accented characters:

from mindmeld.text_preparation.normalizers import ASCIIFold

sentence_es = "Ha pasado un caballero, ¡quién sabe por qué pasó!"
ascii_fold_normalizer = ASCIIFold()
normalized_text = ascii_fold_normalizer.normalize(sentence_es)
print(normalized_text)

The accents are removed and the accented characters have been replaced with compatible ASCII equivalents.

'Ha pasado un caballero, ¡quien sabe por que paso!'

Unicode Character Normalization

Unicode Character Normalization includes techniques such as NFD, NFC, NFKD, NFKC. These methods break down characters into their canonical or compatible character equivalents as defined by unicode. Let's take a look at an example. Say we are trying to normalize the word quién using NFKD.

from mindmeld.text_preparation.normalizers import NFKD

nfd_normalizer = NFKD()
text = "quién"
normalized_text = nfd_normalizer.normalize(text)

Interestingly, we find that the normalized text looks identical with the original text, it is not quite the same.

>>> print(text, normalized_text)
>>> quién quién
>>> print(text == normalized_text)
>>> False

We can print the character values for each of the texts and observe the the normalization has actually changed the representaation for é.

>>> print([ord(c) for c in text])
>>> [113, 117, 105, 233, 110]
>>> print([ord(c) for c in normalized_text])
>>> [113, 117, 105, 101, 769, 110]

Creating a Custom Normalizer

This section includes boilerplate code to build a custom normalizer class. Let's recreate the Lowercase normalizer class. A custom tokenizer must extend from MindMeld's abstract Normalizer class:

from abc import ABC, abstractmethod

class Normalizer(ABC):
    """Abstract Normalizer Base Class."""

    @abstractmethod
    def normalize(self, text):
        """
        Args:
            text (str): Text to normalize.
        Returns:
            normalized_text (str): Normalized Text.
        """
        raise NotImplementedError("Subclasses must implement this method")

With this in mind, let's recreate MindMeld's Lowercase normalizer class.

from mindmeld.text_preparation.normalizers import Normalizer

class Lowercase(Normalizer):

    def normalize(self, text):
        return text.lower()

This normalizer would transform the text "I Like to Run!" to "i like to run!". The steps to use a custom Normalizer in your application are explained here.

Note

MindMeld normalizes queries on a per-token basis. Custom normalizers should be designed to normalize individual tokens and not sentences as a whole.

Stemming

Stemming is the process of reducing a word to its stem or root. If a stemmer is not specified in the TEXT_PREPARATION_CONFIG, then MindMeld will automatically select a stemmer based on the language of the application.

EnglishNLTKStemmer

The EnglishNLTKStemmer stemmer uses a modified version of the PorterStemmer from the nltk library. The Porter stemmer implements a series of rules that removes common suffixes, and this version of it removes inflectional suffixes but leaves (most) derivational suffixes in place. This includes removing the final letters "s"/"es" from plural words or "ing" from gerunds, but leaving more meaningful suffixes like "tion" and "ment" alone.

Let's take a look at a few examples of the EnglishNLTKStemmer. First we'll make an instance of the stemmer:

from mindmeld.text_preparation.stemmers import EnglishNLTKStemmer
english_nltk_stemmer = EnglishNLTKStemmer()

Now let's stem the words "running" and "governments".

>>> print(english_nltk_stemmer.stem_word("running"))
>>> run
>>> print(english_nltk_stemmer.stem_word("governments"))
>>> government

As expected, the stemmer removes "ing" from "running" and the "s" from "governments" to create stemmed words.

SnowballNLTKStemmer

The SnowballNLTKStemmer stemmer works in a similar manner to the EnglishNLTKStemmer, however, it removes more suffixes and offers support for a larger set of languages. Namely, the SnowballNLTKStemmer supports Danish (da), Dutch (nl), Finnish (fi), French (fr), German (de), Hungarian (hu), Italian (it), Norwegian (nb), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es) and Swedish (sv).

To create an instance of the SnowballNLTKStemmer, we can use MindMeld's StemmerFactory.

from mindmeld.text_preparation.stemmers import SnowballNLTKStemmer
es_snowball_stemmer = SnowballNLTKStemmer("spanish")

Now let's stem the words "corriendo" ("running") and "gobiernos" ("governments").

>>> print(es_snowball_stemmer.stem_word("corriendo"))
>>> corr
>>> print(es_snowball_stemmer.stem_word("gobiernos"))
>>> gobi

As expected, the stemmer removes "iendo" from "corriendo" and the "ernos" from "gobiernos" to create stemmed words.

Creating a Custom Stemmer

This section includes boilerplate code to build a custom stemmer class. A custom stemmer must extend from MindMeld's abstract Stemmer class:

from abc import ABC, abstractmethod

class Stemmer(ABC):

    @abstractmethod
    def stem_word(self, word):
        """
        Gets the stem of a word. For example, the stem of the word 'fishing' is 'fish'.

        Args:
            word (str): The word to stem

        Returns:
            stemmed_word (str): A stemmed version of the word
        """
        raise NotImplementedError

Let's create a stemmer that uses Spacy's lemmatization functionality to use lemmatized tokens. We'll call it the SpacyLemmatizer.

from mindmeld.text_preparation.stemmers import Stemmer

class SpacyLemmatizer(Stemmer):

    def __init__(self):
        self.nlp = spacy.load('en_core_web_sm')

    def stem_word(self, word):
        """
        Args:
            word (str): The word to stem

        Returns:
            stemmed_word (str): A lemmatized version of the word
        """

        doc = self.nlp(word)
        return " ".join([token.lemma_ for token in doc])

The SpacyLemmatizer would transform "ran" to "run". The steps to use a custom Stemmer in your application are explained in the section below.

Using a Custom TextPreparationPipeline for your Application

As a recap, every MindMeld project is also a Python package and has an __init.py__ file at the root level. This package also contains an application container -- a container for all of the logic and functionality for your application. This application container enumerates all of the dialogue states and their associated handlers, and should be defined as app in the application's Python package. To use a TextPreparationPipeline with custom components, we must pass in a custom object into the application container in __init.py__. Let's first take a look at an example of an __init.py__ file before a custom TextPreparationPipeline used.

root/__init__.py (Without a Custom Pipeline)
from mindmeld import Application

app = Application(__name__)

@app.handle(intent='greet')
def welcome(request, responder):
    responder.reply('Hello')

Now let's look at this __init.py__ file after a custom TextPreparationPipeline is used. To isolate the logic and functionality of our custom TextPreparationPipeline let's create the object in a separate file at the root level, we'll call it text_preparation_pipeline.py. text_preparation_pipeline.py will contain a function get_text_preparation_pipeline() which we can use to pass the custom pipeline into the application container.

root/__init__.py (With a Custom Pipeline)
from mindmeld import Application
from .text_preparation_pipeline import get_text_preparation_pipeline

app = Application(__name__, text_preparation_pipeline=get_text_preparation_pipeline())

@app.handle(intent='greet')
def welcome(request, responder):
    responder.reply('Hello')

In the text_preparation_pipeline.py file we'll implement the get_text_preparation_pipeline() method which returns a custom TextPreparationPipeline object. Let's piece together multiple custom components into a single TextPreparationPipeline. We will define and use an ASRPreprocessor, GerundSuffixStemmer and RemoveExclamation normalizer. In the code below, we have created each of our components by implementing the respective MindMeld abstract classes. In get_text_preparation_pipeline() we first create a default TextPreparationPipeline using the TextPreparationPipelineFactory. This factory class uses the specifications in the config for the application which is identified by the current path. A series of setter methods are used to update components. Finally, the modified pipeline is returned.

root/text_preparation_pipeline.py
from mindmeld.text_preparation.text_preparation_pipeline import TextPreparationPipelineFactory
from mindmeld.text_preparation.preprocessors import Preprocessor
from mindmeld.text_preparation.stemmers import Stemmer
from mindmeld.text_preparation.normalizers import Normalizer

class ASRPreprocessor(Preprocessor):
    def process(self, text):
        return text.replace("croissant ready", "Prasanth Reddy")

class GerundSuffixStemmer(Stemmer):
    def stem_word(self, word):
        if word.endswith("ing"):
            return word[:-len("ing")]
        return word

class RemoveExclamation(Normalizer):
    def normalize(self, text):
        return text.lower()

def get_text_preparation_pipeline():
    text_preparation_pipeline = TextPreparationPipelineFactory.create_from_app_config("./")
    text_preparation_pipeline.set_preprocessors([ASRPreprocessor()])
    text_preparation_pipeline.normalizers.append(RemoveExclamation())
    text_preparation_pipeline.set_stemmer(GerundSuffixStemmer())
    return text_preparation_pipeline