mindmeld.models.nn_utils.input_encoders module

This module consists of encoders that serve as input to pytorch modules

class mindmeld.models.nn_utils.input_encoders.AbstractEncoder(**kwargs)[source]

Bases: abc.ABC

Defines a stateful tokenizer. Unlike the tokenizer in the text_preperation_pipeline, tokenizers derived from this abstract class have a state such a vocabulary or a trained/pretrained model that is used for encoding an input textual string into sequence of ids or a sequence of embeddings. These outputs are used by the initial layers of neural nets.

batch_encode(examples: List[str], padding_length: int = None, add_terminals: bool = False, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]

Method that encodes a list of texts into a list of sequence of ids

Parameters:
  • examples – List of text strings that will be encoded as a batch
  • padding_length – The maximum length of each encoded input. Sequences less than this length are padded to padding_length, longer sequences are trimmed. If not specified, the max length of examples upon tokenization is used as padding_length.
  • add_terminals – A boolean flag that determines if terminal special tokens are to be added to the tokenized examples or not.
Returns:

A dictionary-like object for the supplied batch of data, consisting of

various tensor inputs to the neural computation graph as well as any other inputs required during the forward computation.

Return type:

BatchData

Special note on add_terminals when using for sequence classification:
This flag can be True or False in general. Setting it to False will lead to errors in case of Huggingface tokenizers as they are generally built to include terminals along with pad tokens. Hence, the default value for add_terminals is False in case of encoders built on top of AbstractVocabLookupEncoder and True for Hugginface ones. This value can be True or False for encoders based on AbstractVocabLookupEncoder for sequence classification.
dump(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the state has to be dumped
get_pad_token_idx() → Union[None, int][source]

If there exists a padding token’s index in the vocab, it is returned; useful while initializing an embedding layer. Else returns a None.

get_vocab() → Dict[source]

Returns a dictionary of vocab tokens as keys and their ids as values

load(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
prepare(examples: List[str])[source]

Method that fits the tokenizer and creates a state that can be dumped or used for encoding

Parameters:examples – List of text strings that will be used for creating the state of the tokenizer
number_of_terminal_tokens

Returns the (maximum) number of terminal tokens used by the encoder during batch encoding when add_terminals is set to True.

class mindmeld.models.nn_utils.input_encoders.AbstractHuggingfaceTrainableEncoder(**kwargs)[source]

Bases: mindmeld.models.nn_utils.input_encoders.AbstractEncoder

Abstract class wrapped around AbstractEncoder that is based on Huggingface’s tokenizers library for creating state model.

reference: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html

batch_encode(examples: List[str], padding_length: int = None, add_terminals: bool = True, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]

output = tokenizer.encode_batch([“Hello, y’all!”, “How are you 😁 ?”]) print(output[1].tokens) # [“[CLS]”, “How”, “are”, “you”, “[UNK]”, “?”, “[SEP]”, “[PAD]”]

Passing the argument padding_length to set the max length for batch encoding is not available yet for Huggingface tokenizers

dump(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the state has to be dumped
get_pad_token_idx() → int[source]

If there exists a padding token’s index in the vocab, it is returned; useful while initializing an embedding layer. Else returns a None.

get_vocab() → Dict[source]

Returns a dictionary of vocab tokens as keys and their ids as values

load(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
prepare(examples: List[str])[source]

references: - Huggingface: tutorials/python/training_from_memory.html @ https://tinyurl.com/6hxrtspa - https://huggingface.co/docs/tokenizers/python/latest/index.html

SPECIAL_TOKENS = ['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]']
number_of_terminal_tokens

Returns the (maximum) number of terminal tokens used by the encoder during batch encoding when add_terminals is set to True.

class mindmeld.models.nn_utils.input_encoders.AbstractVocabLookupEncoder(**kwargs)[source]

Bases: mindmeld.models.nn_utils.input_encoders.AbstractEncoder

Abstract class wrapped around AbstractEncoder that has a vocabulary lookup as the state.

batch_encode(examples: List[str], padding_length: int = None, add_terminals: bool = False, _return_tokenized_examples: bool = False, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]

Method that encodes a list of texts into a list of sequence of ids

Parameters:
  • examples – List of text strings that will be encoded as a batch
  • padding_length – The maximum length of each encoded input. Sequences less than this length are padded to padding_length, longer sequences are trimmed. If not specified, the max length of examples upon tokenization is used as padding_length.
  • add_terminals – A boolean flag that determines if terminal special tokens are to be added to the tokenized examples or not.
Returns:

A dictionary-like object for the supplied batch of data, consisting of

various tensor inputs to the neural computation graph as well as any other inputs required during the forward computation.

Return type:

BatchData

Special note on add_terminals when using for sequence classification:
This flag can be True or False in general. Setting it to False will lead to errors in case of Huggingface tokenizers as they are generally built to include terminals along with pad tokens. Hence, the default value for add_terminals is False in case of encoders built on top of AbstractVocabLookupEncoder and True for Hugginface ones. This value can be True or False for encoders based on AbstractVocabLookupEncoder for sequence classification.
dump(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the state has to be dumped
get_vocab() → Dict[source]

Returns a dictionary of vocab tokens as keys and their ids as values

load(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
prepare(examples: List[str])[source]

Method that fits the tokenizer and creates a state that can be dumped or used for encoding

Parameters:examples – List of text strings that will be used for creating the state of the tokenizer
SPECIAL_TOKENS_DICT = {'end_token': '<END>', 'pad_token': '<PAD>', 'start_token': '<START>', 'unk_token': '<UNK>'}
id2token
number_of_terminal_tokens

Returns the (maximum) number of terminal tokens used by the encoder during batch encoding when add_terminals is set to True.

class mindmeld.models.nn_utils.input_encoders.BytePairEncodingEncoder(**kwargs)[source]

Bases: mindmeld.models.nn_utils.input_encoders.AbstractHuggingfaceTrainableEncoder

Encoder that fits a BPE model based on the input examples

class mindmeld.models.nn_utils.input_encoders.CharEncoder(**kwargs)[source]

Bases: mindmeld.models.nn_utils.input_encoders.AbstractVocabLookupEncoder

A simple tokenizer that tokenizes at character level

class mindmeld.models.nn_utils.input_encoders.HuggingfacePretrainedEncoder(pretrained_model_name_or_path=None, **kwargs)[source]

Bases: mindmeld.models.nn_utils.input_encoders.AbstractEncoder

batch_encode(examples: List[str], padding_length: int = None, add_terminals: bool = True, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]

Method that encodes a list of texts into a list of sequence of ids

Parameters:
  • examples – List of text strings that will be encoded as a batch
  • padding_length – The maximum length of each encoded input. Sequences less than this length are padded to padding_length, longer sequences are trimmed. If not specified, the max length of examples upon tokenization is used as padding_length.
  • add_terminals – A boolean flag that determines if terminal special tokens are to be added to the tokenized examples or not.
Returns:

A dictionary-like object for the supplied batch of data, consisting of

various tensor inputs to the neural computation graph as well as any other inputs required during the forward computation.

Return type:

BatchData

Special note on add_terminals when using for sequence classification:
This flag can be True or False in general. Setting it to False will lead to errors in case of Huggingface tokenizers as they are generally built to include terminals along with pad tokens. Hence, the default value for add_terminals is False in case of encoders built on top of AbstractVocabLookupEncoder and True for Hugginface ones. This value can be True or False for encoders based on AbstractVocabLookupEncoder for sequence classification.
dump(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the state has to be dumped
get_pad_token_idx() → int[source]

If there exists a padding token’s index in the vocab, it is returned; useful while initializing an embedding layer. Else returns a None.

get_vocab() → Dict[source]

Returns a dictionary of vocab tokens as keys and their ids as values

load(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
prepare(examples: List[str])[source]

Method that fits the tokenizer and creates a state that can be dumped or used for encoding

Parameters:examples – List of text strings that will be used for creating the state of the tokenizer
number_of_terminal_tokens

Overwrite parent class’ definition of number of terminal tokens

class mindmeld.models.nn_utils.input_encoders.InputEncoderFactory[source]

Bases: object

classmethod get_encoder_cls(tokenizer_type: str)[source]
TOKENIZER_NAME_TO_CLASS = {<TokenizerType.WHITESPACE_TOKENIZER: 'whitespace-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.WhitespaceEncoder'>, <TokenizerType.CHAR_TOKENIZER: 'char-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.CharEncoder'>, <TokenizerType.WHITESPACE_AND_CHAR_DUAL_TOKENIZER: 'whitespace_and_char-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.WhitespaceAndCharDualEncoder'>, <TokenizerType.BPE_TOKENIZER: 'bpe-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.BytePairEncodingEncoder'>, <TokenizerType.WORDPIECE_TOKENIZER: 'wordpiece-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.WordPieceEncoder'>, <TokenizerType.HUGGINGFACE_PRETRAINED_TOKENIZER: 'huggingface_pretrained-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.HuggingfacePretrainedEncoder'>}
class mindmeld.models.nn_utils.input_encoders.WhitespaceAndCharDualEncoder(**kwargs)[source]

Bases: mindmeld.models.nn_utils.input_encoders.AbstractVocabLookupEncoder

batch_encode(examples: List[str], char_padding_length: int = None, char_add_terminals: bool = True, add_terminals: bool = False, _return_tokenized_examples: bool = False, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]

Method that encodes a list of texts into a list of sequence of ids

Parameters:
  • examples – List of text strings that will be encoded as a batch
  • padding_length – The maximum length of each encoded input. Sequences less than this length are padded to padding_length, longer sequences are trimmed. If not specified, the max length of examples upon tokenization is used as padding_length.
  • add_terminals – A boolean flag that determines if terminal special tokens are to be added to the tokenized examples or not.
Returns:

A dictionary-like object for the supplied batch of data, consisting of

various tensor inputs to the neural computation graph as well as any other inputs required during the forward computation.

Return type:

BatchData

Special note on add_terminals when using for sequence classification:
This flag can be True or False in general. Setting it to False will lead to errors in case of Huggingface tokenizers as they are generally built to include terminals along with pad tokens. Hence, the default value for add_terminals is False in case of encoders built on top of AbstractVocabLookupEncoder and True for Hugginface ones. This value can be True or False for encoders based on AbstractVocabLookupEncoder for sequence classification.
dump(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the state has to be dumped
get_char_pad_token_idx() → Union[None, int][source]

If there exists a char padding token’s index in the vocab, it is returned; useful while initializing an embedding layer. Else returns a None.

get_char_vocab() → Dict[source]
load(path: str)[source]

Method that dumps the state (if any) of the tokenizer

Parameters:path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
prepare(examples: List[str])[source]

Method that fits the tokenizer and creates a state that can be dumped or used for encoding

Parameters:examples – List of text strings that will be used for creating the state of the tokenizer
SPECIAL_CHAR_TOKENS_DICT = {'char_end_token': '<CHAR_END>', 'char_pad_token': '<CHAR_PAD>', 'char_start_token': '<CHAR_START>', 'char_unk_token': '<CHAR_UNK>'}
char_id2token
number_of_char_terminal_tokens

Returns the number of char terminal tokens used by the encoder during batch encoding when add_terminals is set to True

number_of_terminal_tokens

Returns the number of terminal tokens used by the encoder during batch encoding when add_terminals is set to True.

class mindmeld.models.nn_utils.input_encoders.WhitespaceEncoder(**kwargs)[source]

Bases: mindmeld.models.nn_utils.input_encoders.AbstractVocabLookupEncoder

Encoder that tokenizes at whitespace. Not useful for languages such as Chinese.

class mindmeld.models.nn_utils.input_encoders.WordPieceEncoder(**kwargs)[source]

Bases: mindmeld.models.nn_utils.input_encoders.AbstractHuggingfaceTrainableEncoder

Encoder that fits a WordPiece model based on the input examples