mindmeld.models.nn_utils.helpers module

Default params used by various sequence and token classification classes

class mindmeld.models.nn_utils.helpers.BatchData(**kwargs)[source]

Bases: mindmeld.core.Bunch

A dictionary-like object that exposes its keys as attributes and holds various inputs as well as outputs of neural models related to a batch of data, such as tensor encodings, lengths of inputs etc.

Following is the description of the different keys that serve as inputs to neural models:
  • seq_lengths: Number of tokens in each example before adding padding tokens. The number
    includes terminal tokens too if they are added before padding. If using an encoder that splits words in sub-words, seq_lengths still implies number of words (instead of number of sub-words) along with any added terminal tokens; this number is useful in case of token classifiers which require token-level (aka. word-level) outputs as well as in sequence classifiers models such as LSTM.
  • split_lengths: The length of each subgroup (i.e. group of sub-words) in each
    example. Due to its definition, it obviously does not include any terminal tokens in its counts. This can be seen as fine-grained information to seq_lengths values for the encoders with sub-word tokenization. This is again useful in cases of token classifiers to flexibly choose between representations of first sub-word or mean/max pool of sub-words' representations in order to obtain the word-level representations. For lookup table based encoders where words are not broken into sub-words, split_lengths is simply a sequence of ones whose sum indicates the number of words w/o terminal & padding tokens.
  • seq_ids (in case non-pretrained models that require training an embedding layer):
    The encoded ids useful for embedding lookup, including terminal special tokens if asked for, and with padding.
  • attention_masks (only in case of huggingface trainable encoders): Boolean flags
    corresponding to each id in seq_ids, set to 0 if padding token else 1.
  • hgf_encodings (only in huggingface pretrained encoders): A dict of outputs from a
    Pretrained Language Model encoder from Huggingface (shortly dubbed as hgf).
  • char_seq_ids (only in dual tokenizers): Similar to seq_ids but from a char
    tokenizer in case of dual tokenization
  • char_seq_lengths (only in dual tokenizers): Similar to seq_lengths but from a char
    tokenizer in case of dual tokenization. Like seq_lengths, this also includes terminal special tokens from char vocab in the length count whenever added.

Following is the description of the different keys that serve as inputs to neural models:

Following is the description of the different keys that are outputted by neural models:
  • seq_embs: The embeddings produced before final classification (dense) layers by
    sequence-classification classes (generally of shape [batch_size, emd_dim]).
  • token_embs: The embeddings produced before final classification (dense) layers by
    token-classification classes (generally of shape [batch_size, seq_length, emd_dim]).
  • logits: Classification scores (before SoftMax).
  • loss: Classification loss object.
class mindmeld.models.nn_utils.helpers.ClassificationType[source]

Bases: enum.Enum

An enumeration.

TAGGER = 'tagger'
TEXT = 'text'
class mindmeld.models.nn_utils.helpers.EmbedderType[source]

Bases: enum.Enum

An enumeration.

BERT = 'bert'
GLOVE = 'glove'
NONE = None
class mindmeld.models.nn_utils.helpers.SequenceClassificationType[source]

Bases: enum.Enum

An enumeration.

CNN = 'cnn'
EMBEDDER = 'embedder'
LSTM = 'lstm'
class mindmeld.models.nn_utils.helpers.TokenClassificationType[source]

Bases: enum.Enum

An enumeration.

CNN_LSTM = 'cnn-lstm'
EMBEDDER = 'embedder'
LSTM = 'lstm-pytorch'
LSTM_LSTM = 'lstm-lstm'
class mindmeld.models.nn_utils.helpers.TokenizerType[source]

Bases: enum.Enum

An enumeration.

BPE_TOKENIZER = 'bpe-tokenizer'
CHAR_TOKENIZER = 'char-tokenizer'
HUGGINGFACE_PRETRAINED_TOKENIZER = 'huggingface_pretrained-tokenizer'
WHITESPACE_AND_CHAR_DUAL_TOKENIZER = 'whitespace_and_char-tokenizer'
WHITESPACE_TOKENIZER = 'whitespace-tokenizer'
WORDPIECE_TOKENIZER = 'wordpiece-tokenizer'
class mindmeld.models.nn_utils.helpers.ValidationMetricType[source]

Bases: enum.Enum

An enumeration.

ACCURACY = 'accuracy'
F1 = 'f1'
mindmeld.models.nn_utils.helpers.get_default_params(class_name: str)[source]

Returns all the default params based on the inputted class name

Parameters:class_name (str) -- A (child) class name from sequence_classification.py or token_classification.py
mindmeld.models.nn_utils.helpers.get_disk_space_of_model(pytorch_module)[source]

Returns the disk space of a pytorch module in MB units. This includes all weights (trainable and non-trainable) of the module.

Parameters:pytorch_module -- a pytorch neural network module derived from torch.nn.Module
Returns:The size of model when dumped
Return type:size (float)
mindmeld.models.nn_utils.helpers.get_num_weights_of_model(pytorch_module)[source]

Returns the number of trainable and the total parameters in a pytorch module. Returning both helps to do a sanity check if any layers which are meant to be frozen are being trained or not.

Parameters:pytorch_module -- a pytorch neural network module derived from torch.nn.Module
Returns:
A tuple of number of params that are trainable and total number
of params of the pytorch module
Return type:number_of_params (tuple)