mindmeld.models.taggers.pytorch_crf module

class mindmeld.models.taggers.pytorch_crf.CRFModel[source]

Bases: torch.nn.modules.module.Module

PyTorch Model Class for Conditional Random Fields

build_params(num_features, num_classes)[source]

Sets the parameters for the layers in the PyTorch CRF model. Naming convention is kept consistent with the CRFSuite implementation.

Parameters:
  • num_features (int) – Number of features to use in a FeatureHasher feature extractor.
  • num_classes (int) – Number of classes in the tagging model.
compute_marginal_probabilities(inputs, mask)[source]

Function used to calculate the marginal probabilities of each token per tag. Implementation is borrowed from https://github.com/kmkurn/pytorch-crf/pull/37.

Parameters:
  • inputs (torch.Tensor) – Batch of padded input tensors.
  • mask (torch.Tensor) – Batch of mask tensors to account for padded inputs.
Returns:

marginal probabilities for every tag for each token for every sequence.

compute_regularized_loss(l1)[source]
fit(X, y)[source]

Trains the entire PyTorch CRF model.

Parameters:
  • X (list of list of dicts) – Generally a list of feature vectors, one for each training example
  • y (list of lists) – A list of classification labels (encoded by the label_encoder, NOT MindMeld entity objects)
forward(inputs, targets, mask, drop_input=0.0)[source]

The forward pass of the PyTorch CRF model. Returns the predictions or loss depending on whether labels are passed or not.

Parameters:
  • inputs (torch.Tensor) – Batch of input tensors to pass through the model.
  • targets (torch.Tensor or None) – Batch of label tensors.
  • mask (torch.Tensor) – Batch of mask tensors to account for padded inputs.
  • drop_input (float) – Percentage of features to drop from the input.
Returns:

Loss from training or predictions for input sequence.

Return type:

loss (torch.Tensor or list)

get_dataloader(X, y, is_train)[source]

Creates and returns the PyTorch dataloader instance for the training/test data.

Parameters:
  • X (list of list of dicts) – Generally a list of feature vectors, one for each training example
  • y (list of lists or None) – A list of classification labels (encoded by the label_encoder, NOT MindMeld entity objects)
  • is_train (bool) – Whether the dataloader returned is going to be used for training.
Returns:

returns PyTorch dataloader object that can be used to iterate across the data.

Return type:

torch_dataloader (torch.utils.data.dataloader.DataLoader)

get_encoder()[source]
get_params()[source]

Get the parameters for the PyTorch CRF model.

load_best_weights_path(path)[source]

Saves the best weights of the model to a path in the .generated folder.

Parameters:path (str) – Path to save the best model weights.
predict(X)[source]

Gets predicted labels for the data.

Parameters:X (list of list of dicts) – Feature vectors for data to predict labels on.
Returns:Predictions for each token in each sequence.
Return type:preds (list of lists)
predict_marginals(X)[source]

Get marginal probabilites for each tag per token for each sequence.

Parameters:X (list of list of dicts) – Feature vectors for data to predict marginal probabilities on.
Returns:Returns the probability of every tag for each token in a sequence.
Return type:marginals_dict (list of list of dicts)
run_predictions(dataloader, calc_f1=False)[source]

Get predictions for the data by running a inference pass of the model.

Parameters:dataloader (torch.utils.data.dataloader.DataLoader) – Dataloader for test/validation data calc_f1 (bool): Flag to return dev f1 score or return predictions for each token.
Returns:Dev F1 score or predictions for each token in a sequence.
save_best_weights_path(path)[source]

Saves the best weights of the model to a path in the .generated folder.

Parameters:path (str) – Path to save the best model weights.
set_encoder(encoder)[source]
set_params(feat_type='hash', feat_num=50000, stratify_train_val_split=True, drop_input=0.2, batch_size=8, number_of_epochs=100, patience=3, dev_split_ratio=0.2, optimizer='sgd', l1_weight=0, l2_weight=0, random_state=None, **kwargs)[source]

Set the parameters for the PyTorch CRF model and also validates the parameters.

Parameters:
  • feat_type (str) – The type of feature extractor. Supported options are ‘dict’ and ‘hash’.
  • feat_num (int) – The number of features to be used by the FeatureHasher. Is not supported with the DictVectorizer
  • stratify_train_val_split (bool) – Flag to check whether inputs should be stratified during train-dev split.
  • drop_input (float) – The percentage at which to apply a dropout to the input features.
  • batch_size (int) – Training batch size for the model.
  • number_of_epochs (int) – The number of epochs (passes over the training data) to train the model for.
  • patience (int) – Number of epochs to wait for before stopping training if dev score does not improve.
  • dev_split_ratio (float) – Percentage of training data to be used for validation.
  • optimizer (str) – Type of optimizer used for the model. Supported options are ‘sgd’ and ‘adam’.
  • random_state (int) – Integer value to set random seeds for deterministic output.
  • l1_weight (float) – Regularization weight for L1-penalty
  • l2_weight (float) – Regularization weight for L2-penalty
set_random_states()[source]

Sets the random seeds across all libraries used for deterministic output.

train_one_epoch(train_dataloader)[source]

Contains the training code for one epoch.

Parameters:train_dataloader (torch.utils.data.dataloader.DataLoader) – Dataloader for training data
training_loop(train_dataloader, dev_dataloader, tmp_save_path)[source]

Contains the training loop process where we train the model for specified number of epochs.

Parameters:
  • train_dataloader (torch.utils.data.dataloader.DataLoader) – Dataloader for training data
  • dev_dataloader (torch.utils.data.dataloader.DataLoader) – Dataloader for validation data
validate_params(kwargs)[source]

Validate the argument values saved into the CRF model.

class mindmeld.models.taggers.pytorch_crf.Encoder(feature_extractor='hash', num_feats=50000)[source]

Bases: object

Encoder class that is responsible for the feature extraction and label encoding for the PyTorch model.

encode_padded_input(current_seq_len, max_seq_len, x)[source]

Pads the input sequence feature vectors to the max sequence length and returns the sparse torch tensor representation.

Parameters:
  • current_seq_len (int) – Number of tokens in the current example sequence.
  • max_seq_len (int) – Max number of tokens in an example sequence in the current dataset.
  • x (list of dicts) – List of feature vectors, one for each token in the example sequence.
Returns:

Sparse COO tensor representation of padded input sequence

Return type:

sparse_feat_tensor (torch.Tensor)

encode_padded_label(current_seq_len, max_seq_len, y)[source]

Pads the label sequences to the max sequence length and returns the torch tensor representation.

Parameters:
  • current_seq_len (int) – Number of tokens in the current example sequence.
  • max_seq_len (int) – Max number of tokens in an example sequence in the current dataset.
  • y (list of dicts) – List of labels, one for each token in the example sequence.
Returns:

PyTorch tensor representation of padded label sequence

Return type:

label_tensor (torch.Tensor)

get_feats_and_classes()[source]
get_padded_transformed_tensors(inputs_or_labels, seq_lens, is_label)[source]

Returns the encoded and padded sparse tensor representations of the inputs/labels.

Parameters:
  • inputs_or_labels (list of list of dicts) – Generally a list of feature vectors, one for each training example
  • seq_lens (list) – A list of number of tokens in each sequence
  • is_label (bool) – Flag to indicate whether we are encoding input features or labels.
Returns:

PyTorch tensor representation of padded input sequence/labels.

Return type:

encoded_tensors (list of torch.Tensor)

get_tensor_data(feat_dicts, labels=None, fit=False)[source]

Gets the feature dicts and labels transformed into padded PyTorch sparse tensor data.

Parameters:
  • feat_dicts (list of list of dicts) – Generally a list of feature vectors, one for each training example
  • y (list of lists) – A list of classification labels
  • fit (bool) – Flag to whether fit the Feature Extractor or Label Encoder.
Returns:

list of Sparse COO tensor representation of encoded padded input sequence. seq_lens (list of ints): List of actual length of each sequence. encoded_tensor_labels (list of torch.Tensor): list of tensors representations of encoded padded label sequence.

Return type:

encoded_tensor_inputs (list of torch.Tensor)

class mindmeld.models.taggers.pytorch_crf.TaggerDataset(inputs, seq_lens, labels=None)[source]

Bases: torch.utils.data.dataset.Dataset

PyTorch Dataset class used to handle tagger inputs, labels and mask

mindmeld.models.taggers.pytorch_crf.collate_tensors_and_masks(sequence)[source]

Custom collate function that ensures proper batching of sparse tensors, labels and masks.

Parameters:sequence (list of tuples) – Each tuple contains one input tensor, one mask tensor and one label tensor.
Returns:Batched representation of input, label and mask sequences.
mindmeld.models.taggers.pytorch_crf.compute_l1_params(w)[source]
mindmeld.models.taggers.pytorch_crf.compute_l2_params(w)[source]
mindmeld.models.taggers.pytorch_crf.diag_concat_coo_tensors(tensors)[source]

Concatenates sparse PyTorch COO tensors diagonally so that they can processed in batches.

Parameters:tensors (tuple of torch.Tensor) – Tuple of sparse COO tensors to diagonally concatenate.
Returns:A single sparse COO tensor that acts as a single batch.
Return type:stacked_tensor (torch.Tensor)
mindmeld.models.taggers.pytorch_crf.stratify_input(X, y)[source]

Gets the input and labels ready for stratification into train and dev data. Stratification is done based on the presence of unique labels for each sequence. It also duplicates the unique samples across input and labels to ensure that it doesn’t fail with scikit-learn’s train_test_split.

Parameters:
  • X (list) – Generally a list of feature vectors, one for each training example
  • y (list) – A list of classification labels (encoded by the label_encoder, NOT MindMeld entity objects)
Returns:

List of feature vectors, ready for stratification. str_y (list): List of labels, ready for stratification. stratify_tuples (list): Unique label for each example which will be the value used for stratification..

Return type:

str_X (list)