Working with the Entity Recognizer

The Entity Recognizer

  • is run as the third step in the natural language processing pipeline
  • is a sequence labeling or tagging model that detects all the relevant entities in a given query
  • is trained per intent, using all the labeled queries for a given intent, with labels derived from the entity types annotated within the training queries

Every MindMeld app has one entity recognizer for every intent that requires entity detection.

Note

  • This is an in-depth tutorial to work through from start to finish. Before you begin, read the Step-by-Step Guide, paying special attention to the Entity Recognition section.
  • This section requires the Home Assistant blueprint application. To get the app, open a terminal and run mindmeld blueprint home_assistant.

System entities and custom entities

Entities in MindMeld are categorized into two types:

System Entities
Generic entities that are application-agnostic and are automatically detected by MindMeld. Examples include numbers, time expressions, email addresses, URLs and measured quantities like distance, volume, currency and temperature. See More about system entities below.
Custom Entities
Application-specific entities that can only be detected by an entity recognizer that uses statistical models trained with deep domain knowledge. These are generally named entities, like ‘San Bernardino,’ a proper name that could be a location entity. Custom entities that are not based on proper nouns (and therefore are not named entities) are also possible.

This chapter focuses on training entity recognition models for detecting all the custom entities used by your app.

Access the entity recognizer

Working with any natural language processor component falls into two broad phases:

  • First, generate the training data for your app. App performance largely depends on having sufficient quantity and quality of training data. See Step 6.
  • Then, conduct experimentation in the Python shell.

When you are ready to begin experimenting, import the NaturalLanguageProcessor class from the MindMeld nlp module and instantiate an object with the path to your MindMeld project.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor(app_path='home_assistant')
nlp
<NaturalLanguageProcessor 'home_assistant' ready: False, dirty: False>

Verify that the NLP has correctly identified all the domains and intents for your app.

nlp.domains
{
 'greeting': <DomainProcessor 'greeting' ready: False, dirty: False>,
 'smart_home': <DomainProcessor 'smart_home' ready: False, dirty: False>,
 'times_and_dates': <DomainProcessor 'times_and_dates' ready: False, dirty: False>,
 'unknown': <DomainProcessor 'unknown' ready: False, dirty: False>,
 'weather': <DomainProcessor 'weather' ready: False, dirty: False>
}
nlp.domains['times_and_dates'].intents
{
 'change_alarm': <IntentProcessor 'change_alarm' ready: True, dirty: True>,
 'check_alarm': <IntentProcessor 'check_alarm' ready: False, dirty: False>,
 'remove_alarm': <IntentProcessor 'remove_alarm' ready: False, dirty: False>,
 'set_alarm': <IntentProcessor 'set_alarm' ready: True, dirty: True>,
 'start_timer': <IntentProcessor 'start_timer' ready: True, dirty: True>,
 'stop_timer': <IntentProcessor 'stop_timer' ready: False, dirty: False>
}
nlp.domains['weather'].intents
{
 'check_weather': <IntentProcessor 'check_weather' ready: False, dirty: False>
}

Access the EntityRecognizer for an intent of your choice, using the entity_recognizer attribute of the desired intent.

# Entity recognizer for the 'change_alarm' intent in the 'times_and_dates' domain:
er = nlp.domains['times_and_dates'].intents['change_alarm'].entity_recognizer
er
<EntityRecognizer ready: False, dirty: False>
# Entity recognizer for the 'check_weather' intent in the 'weather' domain:
er = nlp.domains['weather'].intents['check_weather'].entity_recognizer
er
<EntityRecognizer ready: False, dirty: False>

Train an entity recognizer

Use the EntityRecognizer.fit() method to train an entity recognition model. Depending on the size of the training data and the selected model, this can take anywhere from a few seconds to several minutes. With logging level set to INFO or below, you should see the build progress in the console along with cross-validation accuracy of the trained model.

from mindmeld import configure_logs; configure_logs()
er = nlp.domains['weather'].intents['check_weather'].entity_recognizer
er.fit()
Fitting entity recognizer: domain='weather', intent='check_weather'
Loading raw queries from file home_assistant/domains/weather/check_weather/train.txt
Loading queries from file home_assistant/domains/weather/check_weather/train.txt
Selecting hyperparameters using k-fold cross validation with 5 splits
Best accuracy: 99.14%, params: {'C': 10000, 'penalty': 'l2'}

The fit() method loads all necessary training queries and trains an entity recognition model. When called with no arguments (as in the example above), the method uses the settings from config.py, the app’s configuration file. If config.py is not defined, the method uses the MindMeld preset classifier configuration.

Using default settings is the recommended (and quickest) way to get started with any of the NLP classifiers. The resulting baseline classifier should provide a reasonable starting point from which to bootstrap your machine learning experimentation. You can then try alternate settings as you seek to identify the optimal classifier configuration for your app.

Classifier configuration

Use the config attribute of a trained classifier to view the configuration that the classifier is using. Here’s an example where we view the configuration of an entity recognizer trained using default settings:

er.config.to_dict()
{
  'features': {
    'bag-of-words-seq': {
      'ngram_lengths_to_start_positions': {
         1: [-2, -1, 0, 1, 2],
         2: [-2, -1, 0, 1]
      }
    },
    'in-gaz-span-seq': {},
    'sys-candidates-seq': {
      'start_positions': [-1, 0, 1]
    }
  },
  'model_settings': {
    'classifier_type': 'memm',
    'feature_scaler': 'max-abs',
    'tag_scheme': 'IOB'
  },
  'model_type': 'tagger',
  'param_selection': {
    'grid': {
      'C': [0.01, 1, 100, 10000, 1000000, 100000000],
      'penalty': ['l1', 'l2']
    },
   'k': 5,
   'scoring': 'accuracy',
   'type': 'k-fold'
  },
  'params': None,
  'train_label_set': 'train.*\.txt',
  'test_label_set': 'test.*\.txt'
}

Let’s take a look at the allowed values for each setting in an entity recognizer configuration.

  1. Model Settings
'model_type' (str)

Always 'tagger', since the entity recognizer is a tagger model. Tagging, sequence tagging, or sequence labeling are common terms used in NLP literature for models that generate a tag for each token in a sequence. Taggers are most commonly used for part-of-speech tagging or named entity recognition.

'model_settings' (dict)

A dictionary containing model-specific machine learning settings. The key 'classifier_type', whose value specifies the machine learning model to use, is required. Allowed values are shown in the table below.

Value Classifier Reference for configurable hyperparameters
'memm' Maximum Entropy Markov Model sklearn.linear_model.LogisticRegression
'crf' Conditional Random Field sklearn-crfsuite
'lstm' Long Short-Term Memory lstm API

Tagger models allow you to specify the additional model settings shown below.

Key Value
'feature_scaler'

The methodology for scaling raw feature values. Applicable to the MEMM model only.

Allowed values are:

  • 'none': No scaling, i.e., use raw feature values.
  • 'std-dev': Standardize features by removing the mean and scaling to unit variance. See StandardScaler.
  • 'max-abs': Scale each feature by its maximum absolute value. See MaxAbsScaler.
'tag_scheme'

The tagging scheme for generating per-token labels.

Allowed values are:

  • 'IOB': The Inside-Outside-Beginning tagging format.
  • 'IOBES': An extension to IOB where 'E' represents the ending token in an entity span, and 'S' represents a single-token entity.
  1. Feature Extraction Settings
'features' (dict)

A dictionary whose keys are names of feature groups to extract. The corresponding values are dictionaries representing the feature extraction settings for each group. The table below enumerates the features that can be used for entity recognition.

Group Name Description
'bag-of-words-seq'

Generates n-grams of specified lengths from the query text surrounding the current token.

Settings:

A dictionary with n-gram lengths as keys and a list of starting positions as values. Each starting position is a token index, relative to the current token.

Examples:

'ngram_lengths_to_start_positions': {1: [0], 2: [0]}
  • extracts all words (unigrams) and bigrams starting with the current token
'ngram_lengths_to_start_positions': {1: [-1, 0, 1], 2: [-1, 0, 1]}
  • additionally includes unigrams and bigrams starting from the words before and after the current token

Given the query “weather in {San Francisco|location} {next week|sys_time}” and a classifier extracting features for the token “Francisco”:

{1: [-1, 0, 1]}
  • extracts “San”, “Francisco”, and “next”
{2: [-1, 0, 1]}
  • extracts “in San”, “San Francisco”, and “Francisco next”

Additionally, you can also limit the n-grams considered while extracting the feature by setting a threshold on their frequency. These frequencies are computed over the entire training set. This prevents infrequent n-grams from being used as features. By default, the threshold is set to 0.

Example:

{
  'ngram_lengths_to_start_positions': {2: [-1, 0], 3: [0]}
  'thresholds': [5]
}
  • extracts all bigrams starting with current token and previous token whose frequency in the training set is 5 or greater. It also extracts all trigrams starting with the current token.
'enable-stemming'

Stemming is the process of reducing inflected words to their word stem or base form. For example, word stem of “eating” is “eat”, word stem of “backwards” is “backward”. MindMeld extracts word stems using a variant of the Porter stemming algorithm that only removes inflectional suffixes.

If this flag is set to True, the stemmed versions of the n-grams are extracted from the query in addition to regular n-grams when using the 'bag-of-words-seq' feature described above.

Example:

'features': {
     'bag-of-words-seq': {
         'ngram_lengths_to_start_positions': {
             1: [-1, 0, 1],
         }
     },
     'enable-stemming': True
}

Given the query “{two|sys_number} orders of {breadsticks|dish}” and a classifier extracting features for the token “of”, the above config would extract [“orders”, “of”, “breadsticks”, “order”, “breadstick”].

'char-ngrams-seq'

Generates character n-grams of specified lengths from the query text surrounding the current token.

Settings:

A dictionary with character n-gram lengths as keys and a list of starting positions as values. Each starting position is a token index, relative to the current token.

Examples:

'ngram_lengths_to_start_positions': {1: [0], 2: [0]}
  • extracts all characters (unigrams) and character bigrams starting with the current token
'ngram_lengths_to_start_positions': {1: [-1, 0, 1], 2: [-1, 0, 1]}
  • additionally includes character unigrams and bigrams starting from the words before and after the current token

Given the query “weather in {Utah|location}” and a classifier extracting features for the token “in”:

{1: [0]}
  • extracts ‘i’, and ‘n’
{2: [-1, 0, 1]}
  • extracts ‘we’, ‘ea’, ‘at’, ‘th’, ‘he’, ‘er’, ‘in’, and ‘Ut’ ‘ta’ ‘ah’

Additionally, you can also limit the character n-grams considered while extracting the feature by setting a threshold on their frequency. These frequencies are computed over the entire training set. This prevents infrequent n-grams from being used as features. By default, the threshold is set to 0.

Example:

{
  'ngram_lengths_to_start_positions': {2: [-1, 0], 3: [0]}
  'thresholds': [5]
}
  • extracts all character bigrams in current token and previous token whose frequency in the training set is 5 or greater. It also extracts all character trigrams in the current token.
'in-gaz-span-seq' Generates a set of features indicating the presence of the current token in different entity gazetteers, along with popularity information (as defined in the gazetteer).
'sys-candidates-seq'

Generates a set of features indicating the presence of system entities in the query text surrounding the current token.

Settings:

A dictionary with a single key named 'start_positions' and a list of different starting positions as its value. As in the 'bag-of-words-seq' feature, each starting position is a token index, relative to the the current token.

Example:

'start_positions': [-1, 0, 1]
  • extracts features indicating whether the current token or its immediate neighbors are system entities

Note

The LSTM model only supports the ‘in-gaz-span-seq’ feature since, for entity recognition tasks, it requires a minimal set of input features to achieve accuracies comparable to traditional models.

  1. Hyperparameter Settings
'params' (dict)

A dictionary of values to be used for model hyperparameters during training. Examples include the norm used in penalization as 'penalty' for MEMM, the coefficients for L1 and L2 regularization 'c1' and 'c2' for CRF, and so on. The list of allowable hyperparameters depends on the model selected. See the reference links above for parameter lists.

'param_selection' (dict)

A dictionary of settings for hyperparameter selection. Provides an alternative to the 'params' dictionary above if the ideal hyperparameters for the model are not already known and need to be estimated.

To estimate parameters, MindMeld needs two pieces of information from the developer:

  1. The parameter space to search, as the value for the 'grid' key
  2. The strategy for splitting the labeled data into training and validation sets, as the value for the 'type' key

Depending on the splitting scheme selected, the param_selection dictionary can contain other keys that define additional settings. The table below enumerates the allowable keys.

Key Value
'grid'

A dictionary which maps each hyperparameter to a list of potential values to search. Here is an example for a logistic regression model:

{
  'penalty': ['l1', 'l2'],
  'C': [10, 100, 1000, 10000, 100000],
   'fit_intercept': [True, False]
}

See the reference links above for details on the hyperparameters available for each model.

'type'

The cross-validation methodology to use. One of:

'k' Number of folds (splits)
'scoring'

The metric to use for evaluating model performance. One of:

  • 'accuracy': Accuracy score at a tag level
  • 'seq_accuracy': Accuracy score at a full sequence level (not available for MEMM)

To identify the parameters that give the highest accuracy, the fit() method does an exhaustive grid search over the parameter space, evaluating candidate models using the specified cross-validation strategy. Subsequent calls to fit() can use these optimal parameters and skip the parameter selection process.

Note

The LSTM model does not support automatic hyperparameter tuning. The user needs to manually tune the hyperparameters for the individual datasets.

  1. Custom Train/Test Settings
'train_label_set' (str)

A string representing a regex pattern that selects all training files for entity model training with filenames that match the pattern. The default regex when this key is not specified is 'train.*\.txt'.

'test_label_set' (str)

A string representing a regex pattern that selects all evaluation files for entity model testing with filenames that match the pattern. The default regex when this key is not specified is 'test.*\.txt'.

Training with custom configurations

To override MindMeld’s default entity recognizer configuration with custom settings, you can either edit the app configuration file, or, you can call the fit() method with appropriate arguments.

1. Application configuration file

When you define custom classifier settings in config.py, the EntityRecognizer.fit() and NaturalLanguageProcessor.build() methods use those settings instead of MindMeld’s defaults. To do this, define a dictionary of your custom settings, named ENTITY_RECOGNIZER_CONFIG.

Here’s an example of a config.py file where custom settings optimized for the app override the preset configuration for the entity recognizer.

ENTITY_RECOGNIZER_CONFIG = {
    'model_type': 'tagger',
    'model_settings': {
        'classifier_type': 'memm',
        'tag_scheme': 'IOBES',
        'feature_scaler': 'max-abs'
    },
    'param_selection': {
        'type': 'k-fold',
        'k': 5,
        'scoring': 'accuracy',
        'grid': {
            'penalty': ['l1', 'l2'],
            'C': [0.01, 1, 100, 10000]
        },
    },
    'features': {
        'bag-of-words-seq': {
            'ngram_lengths_to_start_positions': {
                1: [-2, -1, 0, 1, 2],
                2: [-1, 0, 1]
            }
        },
        'in-gaz-span-seq': {},
        'sys-candidates-seq': {
          'start_positions': [-1, 0, 1]
        }
    }
}

Settings defined in ENTITY_RECOGNIZER_CONFIG apply to entity recognizers across all domains and intents in your application. For finer-grained control, you can implement the get_entity_recognizer_config() function in config.py to specify suitable configurations for each intent. This gives you the flexibility to modify models and features based on the domain and intent.

import copy

def get_entity_recognizer_config(domain, intent):
    SPECIAL_CONFIG = copy.deepcopy(ENTITY_RECOGNIZER_CONFIG)
    if domain == 'smart_home' and intent == 'specify_location':
        param_grid = {
            'c1': [0, 0.1, 0.5, 1],
            'c2': [1, 10, 100]
            }
        SPECIAL_CONFIG['model_setting']['classifier_type'] = 'crf'
        SPECIAL_CONFIG['param_selection']['grid'] = param_grid
    return SPECIAL_CONFIG

Using config.py is recommended for storing your optimal classifier settings once you have identified them through experimentation. Then the classifier training methods will use the optimized configuration to rebuild the models. A common use case is retraining models on newly-acquired training data, without retuning the underlying model settings.

Since this method requires updating a file each time you modify a setting, it’s less suitable for rapid prototyping than the method described next.

2. Arguments to the fit() method

For experimenting with an entity recognizer, the recommended method is to use arguments to the fit() method. The main areas for exploration are feature extraction, hyperparameter tuning, and model selection.

Feature extraction

Let’s start with the baseline classifier that was trained above. Here’s how you get the default feature set used by the classifer.

my_features = er.config.features
my_features
{
  'bag-of-words-seq': {
    'ngram_lengths_to_start_positions': {
      1: [-2, -1, 0, 1, 2],
      2: [-2, -1, 0, 1]
    }
  },
  'in-gaz-span-seq': {},
  'sys-candidates-seq': {
    'start_positions': [-1, 0, 1]
  }
}

Notice that the 'ngram_lengths_to_start_positions' settings tell the classifier to extract n-grams within a context window of two tokens or less around the token of interest — that is, just words in the immediate vicinity.

Let’s have the classifier look at a larger context window — extract n-grams starting from tokens that are further away. We’ll see whether that provides better information than the smaller default window. To do so, change the 'ngram_lengths_to_start_positions' settings to extract all the unigrams and bigrams in a window of three tokens around the current token, as shown below.

my_features['bag-of-words-seq']['ngram_lengths_to_start_positions'] = {
    1: [-3, -2, -1, 0, 1, 2, 3],
    2: [-3, -2, -1, 0, 1, 2]
}
my_features
{
  'bag-of-words-seq': {
    'ngram_lengths_to_start_positions': {
      1: [-3, -2, -1, 0, 1, 2, 3],
      2: [-3, -2, -1, 0, 1, 2]
    }
  },
  'in-gaz-span-seq': {},
  'sys-candidates-seq': {
    'start_positions': [-1, 0, 1]
  }
}

Suppose wi represents the word at the ith index in the query, where the index is calculated relative to the current token. Then, the above feature configuration should extract the following n-grams (w0 being the current token).

  • Unigrams: { w-3, w-2, w-1, w0, w1, w2, w3 }
  • Bigrams: { w-3w-2, w-2w-1, w-1w0, w0w1, w1w2, w2w3 }

To retrain the classifier with the updated feature set, pass in the my_features dictionary as an argument to the features parameter of the fit() method. This trains the entity recognition model using our new feature extraction settings, while continuing to use MindMeld defaults for model type (MEMM) and hyperparameter selection.

er.fit(features=my_features)
Fitting entity recognizer: domain='weather', intent='check_weather'
Selecting hyperparameters using k-fold cross-validation with 5 splits
Best accuracy: 99.04%, params: {'C': 10000, 'penalty': 'l2'}

The exact accuracy number and the selected params might be different each time we run hyperparameter tuning, which we will explore in detail in the next section.

Hyperparameter tuning

View the model’s hyperparameters, keeping in mind the hyperparameters for the MEMM model in MindMeld. These include: 'C', the inverse of regularization strength; and, 'fit_intercept', which determines whether to add an intercept term to the decision function. The 'fit_intercept' parameter is not shown in the response but defaults to 'True'.

my_param_settings = er.config.param_selection
my_param_settings
{
  'grid': {
    'C': [0.01, 1, 100, 10000, 1000000, 100000000],
    'penalty': ['l1', 'l2']
  },
 'k': 5,
 'scoring': 'accuracy',
 'type': 'k-fold'
}

Let’s reduce the range of values to search for 'C', and allow the hyperparameter estimation process to choose whether to add an intercept term to the decision function.

Pass the updated settings to fit() as an argument to the param_selection parameter. The fit() method then searches over the updated parameter grid, and prints the hyperparameter values for the model whose cross-validation accuracy is highest.

my_param_settings['grid']['C'] = [0.01, 1, 100, 10000]
my_param_settings['grid']['fit_intercept'] = ['True', 'False']
my_param_settings
{
  'grid': {
    'C': [0.01, 1, 100, 10000],
    'fit_intercept': ['True', 'False'],
    'penalty': ['l1', 'l2']
  },
 'k': 5,
 'scoring': 'accuracy',
 'type': 'k-fold'
}
er.fit(param_selection=my_param_settings)
Fitting entity recognizer: domain='weather', intent='check_weather'
No app configuration file found. Using default entity model configuration
Selecting hyperparameters using k-fold cross-validation with 5 splits
Best accuracy: 99.09%, params: {'C': 100, 'fit_intercept': 'False', 'penalty': 'l1'}

Finally, we’ll try a new cross-validation strategy of randomized folds, replacing the default of k-fold. We’ll keep the default of five folds. To do this, we modify the values of the 'type' key in my_param_settings:

my_param_settings['type'] = 'shuffle'
my_param_settings
{
  'grid': {
    'C': [0.01, 1, 100, 10000],
    'fit_intercept': ['True', 'False'],
    'penalty': ['l1', 'l2']
  },
 'k': 5,
 'scoring': 'accuracy',
 'type': 'shuffle'
}
er.fit(param_selection=my_param_settings)
Fitting entity recognizer: domain='weather', intent='check_weather'
No app configuration file found. Using default entity model configuration
Selecting hyperparameters using shuffle cross-validation with 5 splits
Best accuracy: 99.39%, params: {'C': 100, 'fit_intercept': 'False', 'penalty': 'l1'}

For a list of configurable hyperparameters for each model, along with available cross-validation methods, see hyperparameter settings.

Model settings

To vary the model training settings, start by inspecting the current settings:

my_model_settings = er.config.model_settings
my_model_settings
{'feature_scaler': 'max-abs', 'tag_scheme': 'IOB'}

For an example experiment, we’ll turn off feature scaling and change the tagging scheme to IOBES, while leaving defaults in place for feature extraction and hyperparameter selection.

Retrain the entity recognition model with our updated settings:

my_model_settings['feature_scaler'] = None
my_model_settings['tag_scheme'] = 'IOBES'
er.fit(model_settings=my_model_settings)
Fitting entity recognizer: domain='weather', intent='check_weather'
No app configuration file found. Using default entity model configuration
Selecting hyperparameters using k-fold cross-validation with 5 splits
Best accuracy: 98.78%, params: {'C': 10000, 'penalty': 'l2'}

Run the entity recognizer

Entity recognition takes place in two steps:

  1. The trained sequence labeling model predicts the output tag (in IOB or IOBES format) with the highest probability for each token in the input query.
  2. The predicted tags are then processed to extract the span and type of each entity in the query.

Run the trained entity recognizer on a test query using the EntityRecognizer.predict() method, which returns a list of detected entities in the query.

er.predict('Weather in San Francisco next week')
(<QueryEntity 'San Francisco' ('city') char: [11-23], tok: [2-3]>,
 <QueryEntity 'next week' ('sys_time') char: [25-33], tok: [4-5]>)

Note

At runtime, the natural language processor’s process() method calls predict() to recognize all the entities in an incoming query.

We want to know how confident our trained model is in its prediction. To view the confidence score of the predicted entity label, use the EntityRecognizer.predict_proba() method. This is useful both for experimenting with the classifier settings and for debugging classifier performance.

The result is a tuple of tuples whose first element is the entity itself and second element is the associated confidence score.

er.predict_proba('Weather in San Francisco next week')
((<QueryEntity 'San Francisco' ('city') char: [11-23], tok: [2-3]>, 0.9994949555840245),
(<QueryEntity 'next week' ('sys_time') char: [25-33], tok: [4-5]>, 0.9994573416716696))

An ideal entity recognizer would assign a high confidence score to the expected (correct) class label for a test query, while assigning very low probabilities to incorrect labels.

Note

Unlike the domain and intent labels, the confidence score reported for an entity sequence is the score associated with the least likely tag in that sequence. For example, the model assigns the tag 'B|city' to the word “San” with some score x and 'I|city' to the word “Francisco” with some score y. The final confidence score associated with this entity is the minimum of x and y.

The predict() and predict_proba() methods take one query at a time. Next, we’ll see how to test a trained model on a batch of labeled test queries.

Evaluate classifier performance

Before you can evaluate the accuracy of your trained entity recognizer, you must first create labeled test data and place it in your MindMeld project as described in the Natural Language Processor chapter.

Then, when you are ready, use the EntityRecognizer.evaluate() method, which

  • strips away all ground truth annotations from the test queries,
  • passes the resulting unlabeled queries to the trained entity recognizer for prediction, and
  • compares the classifier’s output predictions against the ground truth labels to compute the model’s prediction accuracy.

In the example below, the model gets 35 out of 37 test queries correct, resulting in an accuracy of about 94.6%.

er.evaluate()
Loading queries from file weather/check_weather/test.txt
<EntityModelEvaluation score: 94.59%, 35 of 37 examples correct>

Note that this is query-level accuracy. A prediction on a query can only be graded as “correct” when all the entities detected by the entity recognizer exactly match exactly the annotated entities in the test query.

The aggregate accuracy score we see above is only the beginning, because the evaluate() method returns a rich object containing overall statistics, statistics by class, a confusion matrix, and sequence statistics.

Print all the model performance statistics reported by the evaluate() method:

eval = er.evaluate()
eval.print_stats()
Overall tag-level statistics:

   accuracy f1_weighted          tp          tn          fp          fn    f1_macro    f1_micro
      0.986       0.985         204         825           3           3       0.975       0.986



Tag-level statistics by class:

              class      f_beta   precision      recall     support          tp          tn          fp          fn
                 O|       0.990       0.981       1.000         155         155          49           3           0
             B|city       0.985       1.000       0.971          34          33         173           0           1
         B|sys_time       1.000       1.000       1.000           4           4         203           0           0
         I|sys_time       1.000       1.000       1.000           3           3         204           0           0
             I|city       0.900       1.000       0.818          11           9         196           0           2



Confusion matrix:

                           O|         B|city     B|sys_time     I|sys_time         I|city
            O|            155              0              0              0              0
        B|city              1             33              0              0              0
    B|sys_time              0              0              4              0              0
    I|sys_time              0              0              0              3              0
        I|city              2              0              0              0              9



Segment-level statistics:

         le          be         lbe          tp          tn          fp          fn
          0           1           0          36          42           0           1



Sequence-level statistics:

  sequence_accuracy
              0.946

The eval.get_stats() method returns all the above statistics in a structured dictionary without printing them to the console.

Let’s decipher the statistics output by the evaluate() method.

Overall tag-level statistics

Aggregate IOB or IOBES tag-level stats measured across the entire test set:

accuracy Classification accuracy score
f1_weighted Class-weighted average f1 score
tp Number of true positives
tn Number of true negatives
fp Number of false positives
fn Number of false negatives
f1_macro Macro-averaged f1 score
f1_micro Micro-averaged f1 score

When interpreting these statistics, consider whether your app and evaluation results fall into one of the cases below, and if so, apply the accompanying guideline. This list is basic, not exhaustive, but should get you started.

  • Classes are balanced – When the number of annotated entities for each entity type are comparable and each entity type is equally important, focusing on the accuracy metric is usually good enough. For entity recognition it is very unlikely that your data would fall into this category, since the O tag (used for words that are not part of an entity) usually occurs much more often than the I/B/E/S tags (for words that are part of an entity).
  • Classes are imbalanced — In this case, it’s important to take the f1 scores into account. For entity recognition it is also important to consider the segment level statistics described below. By primarily optimizing for f1, your model will tend to predict no entity rather than predict one that is uncertain about. See this blog post.
  • All f1 and accuracy scores are low — When entity recognition is performing poorly across all entity types, either of the following may be the problem: 1) You do not have enough training data for the model to learn, or 2) you need to tune your model hyperparameters. Look at segment-level statistics for a more intuitive breakdown of where the model is making errors.
  • f1 weighted is higher than f1 macro — This means that entity types with fewer evaluation examples are performing poorly. Try adding more data to these entity types. This entails adding more training queries with labeled entities, specifically entities of the type that are performing the worst as indicated in the tag-level statistics table.
  • f1 macro is higher than f1 weighted — This means that entity types with more evaluation examples are performing poorly. Verify that the number of evaluation examples reflects the class distribution of your training examples.
  • f1 micro is higher than f1 macro — This means that certain entity types are being misclassified more often than others. Identify the problematic entity types by checking the tag-level class-wise statistics below. Some entity types may be too similar to others, or you may need to add more training data.
  • Some classes are more important than others — If some entities are more important than others for your use case, it is best to focus especially on the tag-level class-wise statistics below.
Tag-level statistics by class

Tag-level (IOB or IOBES) statistics that are calculated for each class:

class Entity tag (in IOB or IOBES format)
f_beta F-beta score
precision Precision
recall Recall
support Number of test entities with this entity tag (based on ground truth)
tp Number of true positives
tn Number of true negatives
fp Number of false positives
fn Number of false negatives
Confusion matrix

A confusion matrix where each row represents the number of instances in an actual class and each column represents the number of instances in a predicted class. This reveals whether the classifier tends to confuse two classes, i.e., mislabel one tag as another.

Segment-level statistics

Note

Currently, segment-level statistics cannot be generated for the IOBES tag scheme. They are only available for IOB.

Although it is useful to analyze tag-level statistics, they don’t tell the full story for entity recognition in an intuitive way. It helps to think of the entity recognizer as performing two tasks: 1) identifying the span of words that should be part of an entity, and 2) selecting the label for the identified entity. When the recognizer makes a mistake, it misidentifies either the label, the span boundary, or both.

Segment-level statistics capture the distribution of these error types across all the segments in a query.

A segment is either:

  • A continuous span of non-entity tokens, or
  • A continuous span of tokens that represents a single entity

For example, the query “I’ll have an {eggplant parm|dish} and some {breadsticks|dish} please” has five segments: “I’ll have an”, “eggplant parm”, “and some”, “breadsticks”, and “please”.

The table below describes the segment-level statistics available in MindMeld.

Abbreviation Statistic Description
le Label error The classifier correctly predicts the existence of an entity and the span of that entity, but chooses the wrong label. For example, the classifier recognizes that ‘pad thai’ is an entity in the query ‘Order some pad thai’, but labels it as a restaurant entity instead of a dish entity.
be Boundary error The classifier correctly predicts the existence of an entity and its label but misclassifies its span. For example, the classifier predicts that ‘some pad thai’ is a dish entity instead of just ‘pad thai’ in the query ‘Order some pad thai’.
lbe Label-boundary error The classifier correctly predicts the existence of an entity, but gets both the label and the span wrong. For example, the classifier labels ‘some pad thai’ as an option in the query ‘Order some pad thai’. The option label is wrong (dish is correct), and, the boundary is misplaced (because it includes the word ‘some’ which does not belong in the entity).
tp True positive The classifier correctly predicts an entity, its label, and its span.
tn True negative The classifier correctly predicts that that a segment contains no entities. For example, the classifier predicts that the query ‘Hi there’ has no entities.
fp False positive The classifier predicts the existence of an entity that is not there. For example, the classifier predicts that ‘there’ is a dish entity in the query ‘Hi there’.
fn False negative The classifier fails to predict an entity that is present. For example, the classifier predicts no entity in the query ‘Order some pad thai’.

Note that the true positive, true negative, false positive, and false negative values are different when calculated at a segment level rather than a tag level. To illustrate this difference consider the following example:

         I’ll  have  an      eggplant  parm    please
Exp:     O.    O     O       B|dish    I|dish  O
Pred:    O.    O.    B|dish  I|dish.   O.      O

In the traditional tag-level statistics, predicting B|dish instead of O and predicting I|dish instead of B|dish would both be false positives. There would also be 3 true negatives for correctly predicting O.

At the segment level, however, this would be just 2 true negatives (one for the segment ‘I’ll have’ and one for the segment ‘please’), and 1 label-boundary error (for the segment ‘an eggplant parm’).

Considering errors at a segment level is often more intuitive and may even provide better metrics to optimize against, as described here.

Sequence-level Statistics

In MindMeld, we define sequence-level accuracy as the fraction of queries for which the entity recognizer successfully identified all the expected entities.

Now we have a wealth of information about the performance of our classifier. Let’s go further and inspect the classifier’s predictions at the level of individual queries, to better understand error patterns.

View the classifier predictions for the entire test set using the results attribute of the returned eval object. Each result is an instance of the EvaluatedExample class which contains information about the original input query, the expected ground truth label, the predicted label, and the predicted probability distribution over all the class labels.

eval.results
[
  EvaluatedExample(example=<Query 'check temperature outside'>, expected=(), predicted=(), probas=None, label_type='entities'),
  EvaluatedExample(example=<Query 'check temperature in miami'>, expected=(<QueryEntity 'miami' ('city') char: [21-25], tok: [3-3]>,), predicted=(<QueryEntity 'miami' ('city') char: [21-25], tok: [3-3]>,), probas=None, label_type='entities'),
  ...
]

Next, we look selectively at just the correct or incorrect predictions.

list(eval.correct_results())
[
  EvaluatedExample(example=<Query 'check temperature outside'>, expected=(), predicted=(), probas=None, label_type='entities'),
  EvaluatedExample(example=<Query 'check temperature in miami'>, expected=(<QueryEntity 'miami' ('city') char: [21-25], tok: [3-3]>,), predicted=(<QueryEntity 'miami' ('city') char: [21-25], tok: [3-3]>,), probas=None, label_type='entities'),
  ...
]
list(eval.incorrect_results())
[
  EvaluatedExample(example=<Query 'taipei current temperature'>, expected=(<QueryEntity 'taipei' ('city') char: [0-5], tok: [0-0]>,), predicted=(), probas=None, label_type='entities'),
  EvaluatedExample(example=<Query 'london weather'>, expected=(<QueryEntity 'london' ('city') char: [0-5], tok: [0-0]>,), predicted=(), probas=None, label_type='entities')
]

Slicing and dicing these results for error analysis is easily done with list comprehensions.

A simple example of this is to inspect incorrect predictions where the query’s first entity is supposed to be of a particular type. For the city type, we get:

[(r.example, r.expected, r.predicted) for r in eval.incorrect_results() if r.expected and r.expected[0].entity.type == 'city']
[
  (
    <Query 'taipei current temperature'>,
    (<QueryEntity 'taipei' ('city') char: [0-5], tok: [0-0]>,),
    ()
  ),
  (
    <Query 'london weather'>,
    (<QueryEntity 'london' ('city') char: [0-5], tok: [0-0]>,),
    ()
  ),
  (
    <Query 'temperature in san fran'>,
    (<QueryEntity 'san fran' ('city') char: [15-22], tok: [2-3]>,),
    (<QueryEntity 'san' ('city') char: [15-17], tok: [2-2]>,)
  ),
  (
    <Query "how's the weather in the big apple">,
    (<QueryEntity 'big apple' ('city') char: [25-33], tok: [5-6]>,),
    ()
  )
]

The entity recognizer was unable to correctly detect the full city entity in any of the above queries. This is usually a sign that the training data lacks coverage for queries with language patterns or entities like those in the examples above. It could also mean that the gazetteer for this entity type is not comprehensive enough.

Start by looking for similar queries in the training data. You should discover that the check_weather intent does indeed lack labeled training queries like the first two queries above.

To solve this problem, you could try adding more queries annotated with the city entity to the check_weather intent’s training data. Then, the recognition model should be able to generalize better.

The last two misclassified queries feature nicknames ('san fran' and 'the big apple') rather than formal city names. Noticing this, the logical step is to inspect the gazetteer data. You should discover that this gazetteer does indeed lack slang terms and nicknames for cities.

To mitigate this, try expanding the city gazetteer to contain entries like “San Fran”, “Big Apple” and other popular synonyms for location names that are relevant to the weather domain.

Error analysis on the results of the evaluate() method can inform your experimentation and help in building better models. Augmenting training data and adding gazetteer entries should be the first steps, as in the above example. Beyond that, you can experiment with different model types, features, and hyperparameters, as described earlier in this chapter.

Viewing features extracted for entity recognition

While training a new model or investigating a misclassification by the classifier, it is sometimes useful to view the extracted features to make sure they are as expected. For example, there may be non-ASCII characters in the query that are treated differently by the feature extractors. Or the value assigned to a particular feature may be computed differently than you expected. Not extracting the right features could lead to misclassifications. In the example below, we view the features extracted for the query ‘set alarm for 7 am’ using EntityRecognizer.view_extracted_features() method.

er.view_extracted_features("set alarm for 7 am")
[{'bag_of_words|length:1|word_pos:-1': '<$>', 'bag_of_words|length:1|word_pos:0': 'set', 'bag_of_words|length:1|word_pos:1': 'alarm', 'bag_of_words|length:2|word_pos:-1': '<$> set', 'bag_of_words|length:2|word_pos:0': 'set alarm', 'bag_of_words|length:2|word_pos:1': 'alarm for'},
 {'bag_of_words|length:1|word_pos:-1': 'set', 'bag_of_words|length:1|word_pos:0': 'alarm', 'bag_of_words|length:1|word_pos:1': 'for', 'bag_of_words|length:2|word_pos:-1': 'set alarm', 'bag_of_words|length:2|word_pos:0': 'alarm for', 'bag_of_words|length:2|word_pos:1': 'for 0'},
 {'bag_of_words|length:1|word_pos:-1': 'alarm', 'bag_of_words|length:1|word_pos:0': 'for', 'bag_of_words|length:1|word_pos:1': '0', 'bag_of_words|length:2|word_pos:-1': 'alarm for', 'bag_of_words|length:2|word_pos:0': 'for 0', 'bag_of_words|length:2|word_pos:1': '0 am', 'sys_candidate|type:sys_time|granularity:hour|pos:1': 1, 'sys_candidate|type:sys_time|granularity:hour|pos:1|log_len': 1.3862943611198906},
 {'bag_of_words|length:1|word_pos:-1': 'for', 'bag_of_words|length:1|word_pos:0': '0', 'bag_of_words|length:1|word_pos:1': 'am', 'bag_of_words|length:2|word_pos:-1': 'for 0', 'bag_of_words|length:2|word_pos:0': '0 am', 'bag_of_words|length:2|word_pos:1': 'am <$>', 'sys_candidate|type:sys_time|granularity:hour|pos:0': 1, 'sys_candidate|type:sys_time|granularity:hour|pos:0|log_len': 1.3862943611198906, 'sys_candidate|type:sys_time|granularity:hour|pos:1': 1, 'sys_candidate|type:sys_time|granularity:hour|pos:1|log_len': 1.3862943611198906},
 {'bag_of_words|length:1|word_pos:-1': '0', 'bag_of_words|length:1|word_pos:0': 'am', 'bag_of_words|length:1|word_pos:1': '<$>', 'bag_of_words|length:2|word_pos:-1': '0 am', 'bag_of_words|length:2|word_pos:0': 'am <$>', 'bag_of_words|length:2|word_pos:1': '<$> <$>', 'sys_candidate|type:sys_time|granularity:hour|pos:-1': 1, 'sys_candidate|type:sys_time|granularity:hour|pos:-1|log_len': 1.3862943611198906, 'sys_candidate|type:sys_time|granularity:hour|pos:0': 1, 'sys_candidate|type:sys_time|granularity:hour|pos:0|log_len': 1.3862943611198906}]

This is especially useful when you are writing custom feature extractors to inspect whether the right features are being extracted.

Save model for future use

Save the trained entity recognizer for later use by calling the EntityRecognizer.dump() method. The dump() method serializes the trained model as a pickle file and saves it to the specified location on disk.

er.dump(model_path='experiments/entity_recognizer.pkl')
Saving entity recognizer: domain='weather', intent='check_weather'

You can load the saved model anytime using the EntityRecognizer.load() method.

er.load(model_path='experiments/entity_recognizer.pkl')
Loading entity recognizer: domain='weather', intent='check_weather'

More about system entities

System entities are generic application-agnostic entities that all MindMeld applications detect automatically. There is no need to train models to learn system entities; they just work.

Supported system entities are enumerated in the table below.

System Entity Examples
sys_time “today” , “Tuesday, Feb 18” , “last week” , “Mother’s day”
sys_interval “tomorrow morning” , “from 9:30 - 11:00 on tuesday” , “Friday 13th evening”
sys_duration “2 hours” , “half an hour” , “15 minutes”
sys_temperature “64°F” , “71° Fahrenheit” , “twenty seven celsius”
sys_number “fifteen” , “0.62” , “500k” , “66”
sys_ordinal “3rd” , “fourth” , “first”
sys_distance “10 miles” , “2feet” , “0.2 inches” , “3’’ “5km” ,”12cm”
sys_volume “500 ml” , “5liters” , “2 gallons”
sys_amount-of-money “forty dollars” , “9 bucks” , “$30”
sys_email help@cisco.com
sys_url “washpo.com/info” , “foo.com/path/path?ext=%23&foo=bla” , “localhost”
sys_phone-number “+91 736 124 1231” , “+33 4 76095663” , “(626)-756-4757 ext 900”

MindMeld does not assume that any of the system entities are needed in your app. It is the system entities that you annotate in your training data that MindMeld knows are needed.

Note

MindMeld defines sys_time and sys_interval as subtly different entities.
The sys_time entity connotes a value of a single unit of time, where the unit can be a date, an hour, a week, and so on. For example, “tomorrow” is a sys_time entity because it corresponds to a single (unit) date, like “2017-07-08.”

The sys_interval entity connotes a time interval that spans several units of time. For example, “tomorrow morning” is a sys_interval entity because “morning” corresponds to the span of hours from 4 am to 12 pm.

Custom entities, system entities, and training set size

Any application’s training set must focus on capturing all the entity variations and language patterns for the custom entities that the app uses. By contrast, the part of the training set concerned with system entities can be relatively minimal, because MindMeld does not need to train an entity recognition model to recognize system entities.

Annotating system entities

Assuming that you have defined the domain-intent-entity-role hierarchy for your app, you know

  • which system entities your app needs to use
  • what roles (if any) apply to those system entities

Use this knowledge to guide you in annotating any system entities in your training data.

These examples of annotated system entities come from the Home Assistant blueprint application:

- adjust the temperature to {65|sys_temperature}
- {in the morning|sys_interval} set the temperature to {72|sys_temperature}
- change my {6:45|sys_time|old_time} alarm to {7 am|sys_time|new_time}
- move my {6 am|sys_time|old_time} alarm to {3pm in the afternoon|sys_time|new_time}
- what's the forecast for {tomorrow afternoon|sys_interval}

For more examples, see the training data for any of the blueprint apps.

Inspecting how MindMeld detects system entities

To see which token spans in a query are detected as system entities, and what system entities MindMeld thinks they are, use the parse_numerics() function:

from mindmeld.ser import parse_numerics
parse_numerics("tomorrow morning at 9am")
([{'body': 'tomorrow morning',
   'dim': 'time',
   'end': 16,
   'latent': False,
   'start': 0,
   'value': {'from': {'grain': 'hour',
                      'value': '2019-01-12T04:00:00.000-08:00'},
             'to': {'grain': 'hour',
                    'value': '2019-01-12T12:00:00.000-08:00'},
             'type': 'interval'}},
   .
   .
   .
  {'body': '9am',
   'dim': 'time',
   'end': 23,
   'latent': False,
   'start': 20,
   'value': {'grain': 'hour',
             'type': 'value',
             'value': '2019-01-12T09:00:00.000-08:00'}}],
 200)

The parse_numerics() function returns a tuple where the first item is a list of dictionaries with each one representing an extracted entity and the second item is an HTTP status code. Each dictionary in this list represents a token span that MindMeld has detected as a system entity. Dictionaries can have overlapping spans if text could correspond to multiple system entities.

Significant keys and values within these inner dictionaries are shown in the table below.

Key Value Meaning or content
start Non-negative integer The start index of the entity
end Non-negative integer The end index of the entity
body Text The text of the detected entity
dim time , number , or another label The type of the numeric entity
latent Boolean False if the entity contains all necessary information to be an instance of that dimension, True otherwise. E.g. ‘9AM’ would have latent=False for the time dimension. But ‘9’ would have latent=True for the amount-of-money dimension.
value Dictionary with ‘value’, ‘grain’, ‘type’ A dictionary of information about the entity. The ‘value’ key corresponds to the resolved value, the ‘grain’ key is the granularity of the resolved value, and the ‘type’ is either ‘value’ or ‘interval’.

This output is especially useful when debugging system entity behavior.

When MindMeld is unable to resolve a system entity

Two common mistakes when working with system entities are: annotating an entity as the wrong type, and, labeling an unsupported token as an entity. In these cases, MindMeld will be unable to resolve the system entity.

Annotating a system entity as the wrong type

Because sys_interval and sys_time are so close in meaning, developers or annotation scripts sometimes use one in place of the other.

In the example below, both entities should be annotated as sys_time, but one was mislabeled as sys_interval:

change my {6:45|sys_interval|old_time} alarm to {7 am|sys_time|new_time}

MindMeld prints the following error during training:

Unable to load query: Unable to resolve system entity of type 'sys_interval' for '6:45'. Entities found for the following types ['sys_time']

The solution is to change the first entity to {6:45|sys_time|old_time}.

Unsupported tokens in system entities

Not all reasonable-sounding tokens are actually supported by a MindMeld system entity.

In the example below, the token “daily” is annotated as a sys_time entity:

set my alarm {daily|sys_time}

MindMeld prints the following error during training:

Unable to load query: Unable to resolve system entity of type 'sys_time' for 'daily'.

Possible solutions:

  1. Add a custom entity that supports the token in question. For example, a recurrence custom entity could support tokens like “daily”, “weekly”, and so on. The correctly-annotated query would be “set my alarm {daily|recurrence}”.
  2. Remove the entity label from tokens like “daily” and see if the app satisfactorily handles the queries anyway.
  3. Remove all queries that contain unsupported tokens like “daily” entirely from the training data.

Configuring systems entities

System entities can be configured at the application level to be turned on/off. One might want to turn off system entity detection to reduce latency or if one does not have any system entities tagged in the application. By default, MindMeld enables system entity recognition in all apps using the Duckling numerical parser locally on port 7151:

NLP_CONFIG = {
    'system_entity_recognizer': {
       'type': 'duckling',
       'url': 'http://localhost:7151/parse'
    }
}

To switch off system entity detection, specify an empty dictionary for the 'system_entity_recognizer' key:

NLP_CONFIG = {
    'system_entity_recognizer': {}
}