Using LSTM for Entity Recognition ================================= Entity recognition is the one task within the NLP pipeline where deep learning models are among the available classification models. In particular, MindMeld provides a `Bi-Directional Long Short-Term Memory (LSTM) Network `_, which has been shown to perform well on sequence labeling tasks such as entity recognition. The model is implemented in `TensorFlow `_. .. note:: Please make sure to install the Tensorflow requirement by running in the shell: :code:`pip install mindmeld[tensorflow]`. LSTM network overview ^^^^^^^^^^^^^^^^^^^^^ The MindMeld Bi-Directional LSTM network - encodes words as pre-trained word embeddings using Stanford's `GloVe representation `_ - encodes characters using a convolutional network trained on the training data - concatenates the word and character embeddings together and feeds them into the bi-directional LSTM - couples the forget and input gates of the LSTM using a peephole connection, to improve overall accuracies on downstream NLP tasks - feeds the output of the LSTM into a `linear chain Conditional Random Field `_ (CRF) or `Softmax layer `_ which labels the target word as a particular entity The diagram below describes the architecture of a typical Bi-Directional LSTM network. .. figure:: /images/lstm_architecture_fix.png :scale: 50 % :align: center :alt: LSTM architecture diagram Courtesy: Guillaume Genthial This design has these possible advantages: - Deep neural networks (DNNs) outperform traditional machine learning models on training sets with about 1,000 or more queries, according to many research papers. - DNNs require less feature engineering work than traditional machine learning models, because they use only two input features (word embeddings and gazetteers) compared to several hundred (n-grams, system entities, and so on). - On GPU-enabled devices, the network can achieve training time comparable to some of the traditional models in MindMeld. The possible disadvantages are: - Performance may be no better than traditional machine learning models for training sets of about 1,000 queries or fewer. - Training time on CPU-only machines is a lot slower than for traditional machine learning models. - No automated hyperparameter tuning methods like :sk_api:`sklearn.model_selection.GridSearchCV ` are available for LSTMs. LSTM parameter settings ^^^^^^^^^^^^^^^^^^^^^^^ Parameter tuning for an LSTM is more complex than for traditional machine learning models. A good starting point for understanding this subject is Andrej Karpathy's `course notes `_ from the Convolutional Neural Networks for Visual Recognition course at Stanford University. ``'params'`` (:class:`dict`) | A dictionary of values to be used for model hyperparameters during training. +-----------------------------------------+------------------------------------------------------------------------------------------------+ | Parameter name | Description | +=========================================+================================================================================================+ | ``padding_length`` | The sequence model treats this as the maximum number of words in a query. | | | If a query has more words than ``padding_length``, the surplus words are discarded. | | | | | | Typically set to the maximum word length of query expected both at train and predict time. | | | | | | Default: ``20`` | | | | | | Example: | | | | | | ``{'padding_length': 20}`` | | | - a query can have a maximum of twenty words | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``batch_size`` | Size of each batch of training data to feed into the network (which uses mini-batch learning). | | | | | | Default: ``20`` | | | | | | Example: | | | | | | ``{'batch_size': 20}`` | | | - feed twenty training queries to the network for each learning step | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``display_epoch`` | The network displays training accuracy statistics at this interval, measured in epochs. | | | | | | Default: ``5`` | | | | | | Example: | | | | | | ``{'display_epoch': 5}`` | | | - display accuracy statistics every five epochs | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``number_of_epochs`` | Total number of complete iterations of the training data to feed into the network. | | | In each iteration, the data is shuffled to break any prior sequence patterns. | | | | | | Default: ``20`` | | | | | | Example: | | | | | | ``{'number_of_epochs': 20}`` | | | - iterate through the training data twenty times | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``optimizer`` | Optimizer to use to minimize the network's stochastic objective function. | | | | | | Default: ``'adam'`` | | | | | | Example: | | | | | | ``{'optimizer': 'adam'}`` | | | - use the Adam optimizer to minimize the objective function | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``learning_rate`` | Parameter to control the size of weight and bias changes | | | of the training algorithm as it learns. | | | | | | `This `_ | | | article explains Learning Rate in technical terms. | | | | | | Default: ``0.005`` | | | | | | Example: | | | | | | ``{'learning_rate': 0.005}`` | | | - set learning rate to 0.005 | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``dense_keep_prob`` | In the context of the ''dropout'' technique (a regularization method to prevent overfitting), | | | keep probability specifies the proportion of nodes to "keep"—that is, to exempt from dropout | | | during the network's learning phase. | | | | | | The ``dense_keep_prob`` parameter sets the keep probability of the nodes | | | in the dense network layer that connects the output of the LSTM layer | | | to the nodes that predict the named entities. | | | | | | Default: ``0.5`` | | | | | | Example: | | | | | | ``{'dense_keep_prob': 0.5}`` | | | - 50% of the nodes in the dense layer will not be turned off by dropout | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``lstm_input_keep_prob`` | Keep probability for the nodes that constitute the inputs to the LSTM cell. | | | | | | Default: ``0.5`` | | | | | | Example: | | | | | | ``{'lstm_input_keep_prob': 0.5}`` | | | - 50% of the nodes that are inputs to the LSTM cell will not be turned off by dropout | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``lstm_output_keep_prob`` | Keep probability for the nodes that constitute the outputs of the LSTM cell. | | | | | | Default: ``0.5`` | | | | | | | | | Example: | | | | | | ``{'lstm_output_keep_prob': 0.5}`` | | | - 50% of the nodes that are outputs of the LSTM cell will not be turned off by dropout | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``token_lstm_hidden_state_dimension`` | Number of states per LSTM cell. | | | | | | Default: ``300`` | | | | | | Example: | | | | | | ``{'token_lstm_hidden_state_dimension': 300}`` | | | - an LSTM cell will have 300 states | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``token_embedding_dimension`` | Number of dimensions for word embeddings. | | | | | | Allowed values: [50, 100, 200, 300]. | | | | | | Default: ``300`` | | | | | | Example: | | | | | | ``{'token_embedding_dimension': 300}`` | | | - each word embedding will have 300 dimensions | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``gaz_encoding_dimension`` | Number of nodes to connect to the gazetteer encodings in a fully-connected network. | | | | | | Default: ``100`` | | | | | | | | | Example: | | | | | | ``{'gaz_encoding_dimension': 100}`` | | | - 100 nodes will be connected to the gazetteer encodings in a fully-connected network | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``max_char_per_word`` | The sequence model treats this as the maximum number of characters in a word. | | | If a word has more characters than ``max_char_per_word``, the surplus characters are discarded.| | | | | | Usually set to the size of the longest word in the training and test sets. | | | | | | Default: ``20`` | | | | | | Example: | | | | | | ``{'max_char_per_word': 20}`` | | | - a word can have a maximum of twenty characters | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``use_crf_layer`` | If set to ``True``, use a linear chain Conditional Random Field layer for the final layer, | | | which predicts sequence tags. | | | | | | If set to ``False``, use a softmax layer to predict sequence tags. | | | | | | Default: ``False`` | | | | | | Example: | | | | | | ``{'use_crf_layer': True}`` | | | - use the CRF layer | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``use_character_embeddings`` | If set to ``True``, use the character embedding trained on the training data | | | using a convolutional network. | | | | | | If set to ``False``, do not use character embeddings. | | | | | | Note: Using character embedding significantly increases training time | | | compared to vanilla word embeddings only. | | | | | | Default: ``False`` | | | | | | Example: | | | | | | ``{'use_character_embeddings': True}`` | | | - use character embeddings | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``char_window_sizes`` | List of window sizes for convolutions that the network should use | | | to build the character embeddings. | | | Usually in decreasing numerical order. | | | | | | Note: This parameter is needed only if ``use_character_embeddings`` is set to ``True``. | | | | | | Default: ``[5]`` | | | | | | Example: | | | | | | ``{'char_window_sizes': [5, 3]}`` | | | - first, use a convolution of size 5 | | | - next, feed the output of that convolution through a convolution of size 3 | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``character_embedding_dimension`` | Initial dimension of each character before it is fed into the convolutional network. | | | | | | Note: This parameter is needed only if ``use_character_embeddings`` is set to ``True``. | | | | | | Default: ``10`` | | | | | | Example: | | | | | | ``{'character_embedding_dimension': 10}`` | | | - initialize the dimension of each character to ten | +-----------------------------------------+------------------------------------------------------------------------------------------------+ | ``word_level_character_embedding_size`` | The final dimension of each character after it is transformed | | | by the convolutional network. | | | | | | Usually greater than ``character_embedding_dimension`` since it encodes | | | more information about orthography and semantics. | | | | | | Note: This parameter is needed only if ``use_character_embeddings`` is set to ``True``. | | | | | | Default: ``40`` | | | | | | Example: | | | | | | ``{'word_level_character_embedding_size': 40}`` | | | - each character should have dimension of forty, after convolutional network training | +-----------------------------------------+------------------------------------------------------------------------------------------------+