Using LSTM for Entity Recognition¶

Entity recognition is the one task within the NLP pipeline where deep learning models are among the available classification models. In particular, MindMeld provides a Bi-Directional Long Short-Term Memory (LSTM) Network, which has been shown to perform well on sequence labeling tasks such as entity recognition. The model is implemented in TensorFlow.

Note

Please make sure to install the Tensorflow requirement by running in the shell: pip install mindmeld[tensorflow].

LSTM network overview¶

The MindMeld Bi-Directional LSTM network

encodes words as pre-trained word embeddings using Stanford's GloVe representation

encodes characters using a convolutional network trained on the training data

concatenates the word and character embeddings together and feeds them into the bi-directional LSTM

couples the forget and input gates of the LSTM using a peephole connection, to improve overall accuracies on downstream NLP tasks

feeds the output of the LSTM into a linear chain Conditional Random Field (CRF) or Softmax layer which labels the target word as a particular entity

The diagram below describes the architecture of a typical Bi-Directional LSTM network.

Courtesy: Guillaume Genthial

This design has these possible advantages:

Deep neural networks (DNNs) outperform traditional machine learning models on training sets with about 1,000 or more queries, according to many research papers.
DNNs require less feature engineering work than traditional machine learning models, because they use only two input features (word embeddings and gazetteers) compared to several hundred (n-grams, system entities, and so on).
On GPU-enabled devices, the network can achieve training time comparable to some of the traditional models in MindMeld.

The possible disadvantages are:

Performance may be no better than traditional machine learning models for training sets of about 1,000 queries or fewer.
Training time on CPU-only machines is a lot slower than for traditional machine learning models.
No automated hyperparameter tuning methods like sklearn.model_selection.GridSearchCV are available for LSTMs.

LSTM parameter settings¶

Parameter tuning for an LSTM is more complex than for traditional machine learning models. A good starting point for understanding this subject is Andrej Karpathy's course notes from the Convolutional Neural Networks for Visual Recognition course at Stanford University.

'params' (dict): A dictionary of values to be used for model hyperparameters during training.

Parameter name	Description
`padding_length`	The sequence model treats this as the maximum number of words in a query. If a query has more words than `padding_length`, the surplus words are discarded. Typically set to the maximum word length of query expected both at train and predict time. Default: `20` Example: `{'padding_length': 20}` a query can have a maximum of twenty words
`batch_size`	Size of each batch of training data to feed into the network (which uses mini-batch learning). Default: `20` Example: `{'batch_size': 20}` feed twenty training queries to the network for each learning step
`display_epoch`	The network displays training accuracy statistics at this interval, measured in epochs. Default: `5` Example: `{'display_epoch': 5}` display accuracy statistics every five epochs
`number_of_epochs`	Total number of complete iterations of the training data to feed into the network. In each iteration, the data is shuffled to break any prior sequence patterns. Default: `20` Example: `{'number_of_epochs': 20}` iterate through the training data twenty times
`optimizer`	Optimizer to use to minimize the network's stochastic objective function. Default: `'adam'` Example: `{'optimizer': 'adam'}` use the Adam optimizer to minimize the objective function
`learning_rate`	Parameter to control the size of weight and bias changes of the training algorithm as it learns. This article explains Learning Rate in technical terms. Default: `0.005` Example: `{'learning_rate': 0.005}` set learning rate to 0.005
`dense_keep_prob`	In the context of the ''dropout'' technique (a regularization method to prevent overfitting), keep probability specifies the proportion of nodes to "keep"—that is, to exempt from dropout during the network's learning phase. The `dense_keep_prob` parameter sets the keep probability of the nodes in the dense network layer that connects the output of the LSTM layer to the nodes that predict the named entities. Default: `0.5` Example: `{'dense_keep_prob': 0.5}` 50% of the nodes in the dense layer will not be turned off by dropout
`lstm_input_keep_prob`	Keep probability for the nodes that constitute the inputs to the LSTM cell. Default: `0.5` Example: `{'lstm_input_keep_prob': 0.5}` 50% of the nodes that are inputs to the LSTM cell will not be turned off by dropout
`lstm_output_keep_prob`	Keep probability for the nodes that constitute the outputs of the LSTM cell. Default: `0.5` Example: `{'lstm_output_keep_prob': 0.5}` 50% of the nodes that are outputs of the LSTM cell will not be turned off by dropout
`token_lstm_hidden_state_dimension`	Number of states per LSTM cell. Default: `300` Example: `{'token_lstm_hidden_state_dimension': 300}` an LSTM cell will have 300 states
`token_embedding_dimension`	Number of dimensions for word embeddings. Allowed values: [50, 100, 200, 300]. Default: `300` Example: `{'token_embedding_dimension': 300}` each word embedding will have 300 dimensions
`gaz_encoding_dimension`	Number of nodes to connect to the gazetteer encodings in a fully-connected network. Default: `100` Example: `{'gaz_encoding_dimension': 100}` 100 nodes will be connected to the gazetteer encodings in a fully-connected network
`max_char_per_word`	The sequence model treats this as the maximum number of characters in a word. If a word has more characters than `max_char_per_word`, the surplus characters are discarded. Usually set to the size of the longest word in the training and test sets. Default: `20` Example: `{'max_char_per_word': 20}` a word can have a maximum of twenty characters
`use_crf_layer`	If set to `True`, use a linear chain Conditional Random Field layer for the final layer, which predicts sequence tags. If set to `False`, use a softmax layer to predict sequence tags. Default: `False` Example: `{'use_crf_layer': True}` use the CRF layer
`use_character_embeddings`	If set to `True`, use the character embedding trained on the training data using a convolutional network. If set to `False`, do not use character embeddings. Note: Using character embedding significantly increases training time compared to vanilla word embeddings only. Default: `False` Example: `{'use_character_embeddings': True}` use character embeddings
`char_window_sizes`	List of window sizes for convolutions that the network should use to build the character embeddings. Usually in decreasing numerical order. Note: This parameter is needed only if `use_character_embeddings` is set to `True`. Default: `[5]` Example: `{'char_window_sizes': [5, 3]}` first, use a convolution of size 5 next, feed the output of that convolution through a convolution of size 3
`character_embedding_dimension`	Initial dimension of each character before it is fed into the convolutional network. Note: This parameter is needed only if `use_character_embeddings` is set to `True`. Default: `10` Example: `{'character_embedding_dimension': 10}` initialize the dimension of each character to ten
`word_level_character_embedding_size`	The final dimension of each character after it is transformed by the convolutional network. Usually greater than `character_embedding_dimension` since it encodes more information about orthography and semantics. Note: This parameter is needed only if `use_character_embeddings` is set to `True`. Default: `40` Example: `{'word_level_character_embedding_size': 40}` each character should have dimension of forty, after convolutional network training