mindmeld.text_preparation.tokenizers module

This module contains Tokenizers.

class mindmeld.text_preparation.tokenizers.CharacterTokenizer[source]

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A Tokenizer that splits text at the character level.

tokenize(text)[source]

Split characters into separate tokens while skipping spaces. :param text: the text to tokenize :type text: str

Returns:
List of tokenized tokens which a represented as dictionaries.
Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:tokens (List[Dict])
class mindmeld.text_preparation.tokenizers.LetterTokenizer[source]

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A Tokenizer that splits text into a separate token if the character proceeds a space, is a non-latin character, or is a different unicode category than the previous character.

static create_tokens(text, token_num_by_char)[source]

Generate token dictionaries from the original text and the token numbers by character. :param text: the text to tokenize :type text: str :param token_num_by_char: Token number that each character belongs to.

Spaces are represented as None. For example: [1,2,2,3,None,4,None,5,5,5]
Returns:
List of tokenized tokens which a represented as dictionaries.
Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:tokens (List[Dict])
static get_token_num_by_char(text)[source]

Determine the token number for each character.

More details about unicode categories can be found here: http://www.unicode.org/reports/tr44/#General_Category_Values. :param text: The text to process and get actions per character. :type text: str

Returns:
Token number that each character belongs to.
Spaces are represented as None. For example: [1,2,2,3,None,4,None,5,5,5]
Return type:token_num_by_char (List[str])
tokenize(text)[source]

Identify tokens in text and create normalized tokens that contain the text and start index. :param text: the text to tokenize :type text: str

Returns:
List of tokenized tokens which a represented as dictionaries.
Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:tokens (List[Dict])
class mindmeld.text_preparation.tokenizers.NoOpTokenizer[source]

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A No-Ops tokenizer.

tokenize(text)[source]

Returns the original text as a list. :param text: Input text. :type text: str

Returns:List of tokens.
Return type:tokens (List[str])
class mindmeld.text_preparation.tokenizers.SpacyTokenizer(language, spacy_model_size='sm')[source]

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A Tokenizer that uses Spacy to split text into tokens.

tokenize(text)[source]
Parameters:text (str) -- Input text.
Returns:
List of tokenized tokens which a represented as dictionaries.
Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:tokens (List[Dict])
class mindmeld.text_preparation.tokenizers.Tokenizer[source]

Bases: abc.ABC

Abstract Tokenizer Base Class.

tojson()[source]

Method defined to obtain recursive JSON representation of a TextPreparationPipeline.

Parameters:None. --
Returns:JSON representation of TextPreparationPipeline (dict) .
tokenize(text)[source]
Parameters:text (str) -- Input text.
Returns:List of tokens.
Return type:tokens (List[str])
class mindmeld.text_preparation.tokenizers.TokenizerFactory[source]

Bases: object

Tokenizer Factory Class

static get_default_tokenizer()[source]

Creates the default tokenizer (WhiteSpaceTokenizer) irrespective of the language of the current application.

Parameters:language (str, optional) -- Language as specified using a 639-1/2 code.
Returns:Tokenizer Class
Return type:(Tokenizer)
static get_tokenizer(tokenizer: str, language='en', spacy_model_size='sm')[source]

A static method to get a tokenizer

Parameters:
  • tokenizer (str) -- Name of the desired tokenizer class
  • language (str, optional) -- Language as specified using a 639-1/2 code.
  • spacy_model_size (str, optional) -- Size of the Spacy model to use. ("sm", "md", or "lg")
Returns:

Tokenizer Class

Return type:

(Tokenizer)

class mindmeld.text_preparation.tokenizers.WhiteSpaceTokenizer[source]

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A Tokenizer that splits text at spaces.

tokenize(text)[source]

Identify tokens in text and token dictionaries that contain the text and start index. :param text: the text to tokenize :type text: str

Returns:
List of tokenized tokens which a represented as dictionaries.
Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:tokens (List[Dict])