Working with the Tokenizer

MindMeld's tokenizer handles both the tokenization and normalization of raw text in your application. These components are configurable based on language-specific needs.

Tokenizer Configuration

The DEFAULT_TOKENIZER_CONFIG shown below is the default config for the Tokenizer. A custom config can be included in config.py by duplicating the default config and renaming it to TOKENIZER_CONFIG. If no custom configuration is defined, the default is used.

DEFAULT_TOKENIZER_CONFIG = {
    "allowed_patterns": default_allowed_patterns,
    "tokenizer": "WhiteSpaceTokenizer",
    "normalizer": "ASCIIFold",
}

Let's define the the parameters in the Tokenizer config:

'allowed_patterns' (str): Enables defining your custom regular expression patterns in the form of a list of different patterns or combinations. (If allowed_patterns are not provided, then default values will be used.) This list is combined and compiled internally by MindMeld and the resulting pattern is applied for filtering out the characters from the user input queries. For eg.

TOKENIZER_CONFIG = {
     "allowed_patterns": ['\w+'],
 }

will allow the system to capture alphanumeric strings and

TOKENIZER_CONFIG = {
     "allowed_patterns": ['(\w+\.)$', '(\w+\?)$'],
 }

allows the system to capture only tokens that end with either a question mark or a period.

'tokenizer' (str): The tokenization method to split raw queries. Options include WhiteSpaceTokenizer, CharacterTokenizer, and the SpacyTokenizer.

'normalizer' (str): The method to normalize raw queries. Options include ASCIIFold and Unicode Character normalization methods such as NFD, NFC, NFKD, NFKC. For more information on Unicode Chracter Normalization visit the Unicode Documentation. Currently, only one normalizer can be selected at a time.

Tokenizer Methods

White Space Tokenizer

The WhiteSpaceTokenizer splits up a sentence by whitespace characters. For example, we can run:

from mindmeld.text_preparation.tokenizers import WhiteSpaceTokenizer

sentence = "MindMeld is a Conversational AI Platform."
white_space_tokenizer = WhiteSpaceTokenizer()
tokens = white_space_tokenizer.tokenize(sentence)
print([t['text'] for t in tokens])

We find that the resulting tokens are split by whitespace as expected.

['MindMeld', 'is', 'a', 'Conversational', 'AI', 'Platform.']

Character Tokenizer

The CharacterTokenizer splits up a sentence by the individual characters. This can be helpful for languages such as Japanese. Let's break apart the Japanese translation for the phrase "The tall man":

from mindmeld.text_preparation.tokenizers import CharacterTokenizer

sentence_ja = "背の高い男性"
character_tokenizer = CharacterTokenizer()
tokens = character_tokenizer.tokenize(sentence_ja)
print([t['text'] for t in tokens])

We see that the original text is split at the character level.

['背', 'の', '高', 'い', '男', '性']

Letter Tokenizer

The LetterTokenizer splits text into a separate token if the character proceeds a space, is a non-latin character, or is a different unicode category than the previous character.

This can be helpful to keep characters of the same type together. Let's look at an example with numbers in a Japanese sentence, "1年は365日". This sentence translates to "One year has 365 days".

from mindmeld.text_preparation.tokenizers import LetterTokenizer

sentence_ja = "1年は365日"
letter_tokenizer = LetterTokenizer()
tokens = letter_tokenizer.tokenize(sentence_ja)
print([t['text'] for t in tokens])

We see that the original text is split at the character level for non-latin characters but the number "365" remains as an unsegmented token.

['1', '年', 'は', '365', '日']

Spacy Tokenizer

The SpacyTokenizer splits up a sentence using Spacy's language models. Supported languages include English (en), Spanish (es), French (fr), German (de), Danish (da), Greek (el), Portuguese (pt), Lithuanian (lt), Norwegian Bokmal (nb), Romanian (ro), Polish (pl), Italian (it), Japanese (ja), Chinese (zh), Dutch (nl). If the required Spacy model is not already present it will automatically downloaded during runtime. Let's use the SpacyTokenizer to tokenize the Japanese translation of "The gentleman is gone, no one knows why it happened!":

from mindmeld.text_preparation.tokenizers import SpacyTokenizer

sentence_ja = "紳士が過ぎ去った、 なぜそれが起こったのか誰にも分かりません!"
spacy_tokenizer_ja = SpacyTokenizer(language="ja", spacy_model_size="lg")
tokens = spacy_tokenizer_ja.tokenize(sentence_ja)

We see that the original text is split semantically and not simply by whitespace.

['紳士', 'が', '過ぎ', '去っ', 'た', '、', 'なぜ', 'それ', 'が', '起こっ', 'た', 'の', 'か', '誰', 'に', 'も', '分かり', 'ませ', 'ん', '!']

Normalization Methods

Default MindMeld Normalization

As a default in MindMeld, the Tokenizer retains the following special characters in addition to alphanumeric characters and spaces:

  1. All currency symbols in UNICODE.
  2. Entity annotation symbols {, }, |.
  3. Decimal point in numeric values (e.g. 124.45).
  4. Apostrophe within tokens, such as O'Reilly. Apostrophes at the beginning/end of tokens are removed, say Dennis' or 'Tis.

Setting argument keep_special_chars=False in the Tokenizer would remove all special characters.

ASCII Fold Normalization

The ASCIIFold normalizer converts numeric, symbolic and alphabetic characters which are not in the first 127 ASCII characters (Basic Latin Unicode block) into an ASCII equivalent (if possible).

For example, we can normalize the following Spanish sentence with several accented characters:

from mindmeld.text_preparation.normalizers import ASCIIFold

sentence_es = "Ha pasado un caballero, ¡quién sabe por qué pasó!"
ascii_fold_normalizer = ASCIIFold()
normalized_text = ascii_fold_normalizer.normalize(sentence_es)
print(normalized_text)

The accents are removed and the accented characters have been replaced with compatible ASCII equivalents.

'Ha pasado un caballero, ¡quien sabe por que paso!'

Unicode Character Normalization

Unicode Character Normalization includes techniques such as NFD, NFC, NFKD, NFKC. These methods break down characters into their canonical or compatible character equivalents as defined by unicode. Let's take a look at an example. Say we are trying to normalize the word quién using NFKD.

from mindmeld.text_preparation.normalizers import NFKD

nfd_normalizer = NFKD()
text = "quién"
normalized_text = nfd_normalizer.normalize(text)

Interestingly, we find that the normalized text looks identical with the original text, it is not quite the same.

>>> print(text, normalized_text)
>>> quién quién
>>> print(text == normalized_text)
>>> False

We can print the character values for each of the texts and observe the the normalization has actually changed the representaation for é.

>>> print([ord(c) for c in text])
>>> [113, 117, 105, 233, 110]
>>> print([ord(c) for c in normalized_text])
>>> [113, 117, 105, 101, 769, 110]