Working with the Role Classifier

The Role Classifier

  • is run as the fourth step in the natural language processing pipeline
  • is a machine-learned classification model that determines the target roles for entities in a given query
  • is trained per entity type, using all the labeled queries for a given intent, with labels derived from the role types annotated within the training queries

Every MindMeld app has one role classifier for every entity type with associated roles.

Note

This is an in-depth tutorial to work through from start to finish. Before you begin, read the Step-by-Step Guide, paying special attention to the Role Classification section.

Access a role classifier

Working with the natural language processor falls into two broad phases:

  • First, generate the training data for your app. App performance largely depends on having sufficient quantity and quality of training data. See Step 6.
  • Then, conduct experimentation in the Python shell.

When you are ready to begin experimenting, import the NaturalLanguageProcessor (NLP) class from the MindMeld nlp module and instantiate an object with the path to your MindMeld project.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor(app_path='home_assistant')
nlp
<NaturalLanguageProcessor 'home_assistant' ready: False, dirty: False>

Verify that the NLP has correctly identified all the domains and intents for your app.

nlp.domains
{
 'smart_home': <DomainProcessor 'smart_home' ready: False, dirty: False>,
 'times_and_dates': <DomainProcessor 'times_and_dates' ready: False, dirty: False>,
 'unknown': <DomainProcessor 'unknown' ready: False, dirty: False>,
 'weather': <DomainProcessor 'weather' ready: False, dirty: False>,
 'greeting': <DomainProcessor 'greeting' ready: False, dirty: False>
}
nlp.domains['times_and_dates'].intents
{
 'change_alarm': <IntentProcessor 'change_alarm' ready: True, dirty: True>,
 'check_alarm': <IntentProcessor 'check_alarm' ready: False, dirty: False>,
 'remove_alarm': <IntentProcessor 'remove_alarm' ready: False, dirty: False>,
 'set_alarm': <IntentProcessor 'set_alarm' ready: True, dirty: True>,
 'start_timer': <IntentProcessor 'start_timer' ready: True, dirty: True>,
 'stop_timer': <IntentProcessor 'stop_timer' ready: False, dirty: False>,
 'specify_time': <IntentProcessor 'specify_time' ready: False, dirty: False>
}
...
nlp.domains['weather'].intents
{
 'check_weather': <IntentProcessor 'check_weather' ready: False, dirty: False>
}

Note

Until the labeled training queries have been loaded, MindMeld is not aware of the different entity types for your app.

Use the build() method to load the training queries for an intent of your choice. This can take several minutes for intents with a large number of training queries. Once the build is complete, inspect the entity types.

nlp.domains['times_and_dates'].intents['change_alarm'].build()
nlp.domains['times_and_dates'].intents['change_alarm'].entities
{
 'sys_time': <EntityProcessor 'sys_time' ready: True, dirty: True>,
 'sys_interval': <EntityProcessor 'sys_interval' ready: True, dirty: True>
}

Access the RoleClassifier for an entity type of your choice, using the role_classifier attribute of the desired entity.

rc = nlp.domains['times_and_dates'].intents['change_alarm'].entities['sys_time'].role_classifier
rc
<RoleClassifier ready: True, dirty: True>

Train a role classifier

Use the RoleClassifier.fit() method to train a role classification model. Depending on the size of the training data, this can take anywhere from a few seconds to several minutes. With logging level set to INFO or below, you should see the build progress in the console along with cross-validation accuracy for the classifier.

from mindmeld import configure_logs; configure_logs()
from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor(app_path='home_assistant')
nlp.domains['times_and_dates'].intents['change_alarm'].build()

rc = nlp.domains['times_and_dates'].intents['change_alarm'].entities['sys_time'].role_classifier
rc.fit()
Fitting role classifier: domain='times_and_dates', intent='change_alarm', entity_type='sys_time'
No role model configuration set. Using default.

The fit() method loads all necessary training queries and trains a role classification model. When called with no arguments (as in the example above), the method uses the settings from config.py, the app's configuration file. If config.py is not defined, the method uses the MindMeld preset classifier configuration.

Using default settings is the recommended (and quickest) way to get started with any of the NLP classifiers. The resulting baseline classifier should provide a reasonable starting point from which to bootstrap your machine learning experimentation. You can then try alternate settings as you seek to identify the optimal classifier configuration for your app.

Classifier configuration

Use the config attribute of a trained classifier to view the configuration that the classifier is using. Here's an example where we view the configuration of a role classifier trained using default settings:

rc.config.to_dict()
{
  'features': {
    'bag-of-words-after': {
      'ngram_lengths_to_start_positions': {1: [0, 1], 2: [0, 1]}
    },
    'bag-of-words-before': {
      'ngram_lengths_to_start_positions': {1: [-2, -1], 2: [-2, -1]}
    },
    'in-gaz': {},
    'other-entities': {}
  },
  'model_settings': {'classifier_type': 'logreg'},
  'model_type': 'text',
  'param_selection': None,
  'params': {'C': 100, 'penalty': 'l1'}
}

Let's take a look at the allowed values for each setting in a role classifier configuration.

  1. Model Settings
'model_type' (str)

Always 'text', since role classification is a text classification model.

'model_settings' (dict)

Always a dictionary with the single key 'classifier_type', whose value specifies the machine learning model to use. Allowed values are shown in the table below.

Value Classifier Reference for configurable hyperparameters
'logreg' Logistic regression sklearn.linear_model.LogisticRegression
'svm' Support vector machine sklearn.svm.SVC
'dtree' Decision tree sklearn.tree.DecisionTreeClassifier
'rforest' Random forest sklearn.ensemble.RandomForestClassifier
  1. Feature Extraction Settings
'features' (dict)

A dictionary whose keys are names of feature groups to extract. The corresponding values are dictionaries representing the feature extraction settings for each group. The table below enumerates the features that can be used for role classification.

Group Name Description
'bag-of-words-after'

Generates n-grams of specified lengths from the query text following the current entity.

Settings:

A dictionary with n-gram lengths as keys and a list of different starting positions as values. Each starting position is a token index, relative to the the start of the current entity span.

Examples:

'ngram_lengths_to_start_positions': {1: [0], 2: [0]}
  • extracts all words (unigrams) and bigrams starting with the first word of the current entity span
'ngram_lengths_to_start_positions': {1: [0, 1], 2: [0, 1]}
  • additionally includes unigrams and bigrams starting from the word after the current entity's first token

Given the query "Change my {6 AM|sys_time|old_time} alarm to {7 AM|sys_time|new_time}" and a classifier extracting features for the "6 AM" sys_time entity:

{1: [0, 1]}
  • extracts "6" and "AM"
{2: [0, 1]}
  • extracts "6 AM" and "AM alarm"
'bag-of-words-before'

Generates n-grams of specified lengths from the query text preceding the current entity.

Settings:

A dictionary with n-gram lengths as keys and a list of different starting positions as values, similar to the 'bag-of-words-after' feature group.

Examples:

Given the query "Change my {6 AM|sys_time|old_time} alarm to {7 AM|sys_time|new_time}" and a classifier extracting features for the "6 AM" sys_time entity:

{1: [-2, -1]}
  • extracts "change" and "my"
{2: [-2, -1]}
  • extracts "change my" and "my 6"
'in-gaz' Generates a set of features indicating the presence of query n-grams in different entity gazetteers, along with popularity information as defined in the gazetteer.
'numeric' Generates a set of features indicating the presence of numeric entities in the query extracted by the numerical parser. These numeric entities include only time and interval entities and are labelled as sys_time and sys_interval.
'other-entities' Encodes information about the other entities present in the query than the current one.

Note

To define your own features or custom versions of these in-built features, see Working with User-Defined Features.

  1. Hyperparameter Settings
'params' (dict)

A dictionary of values to be used for model hyperparameters during training. Examples include the 'kernel' parameter for SVM, 'penalty' for logistic regression, 'max_depth' for decision tree, and so on. The list of allowable hyperparameters depends on the model selected. See the reference links above for parameter lists.

'param_selection' (dict)

Is a dictionary containing the settings for hyperparameter selection. This is used as an alternative to the 'params' dictionary above if the ideal hyperparameters for the model are not already known and need to be estimated.

MindMeld needs two pieces of information from the developer to do parameter estimation:

  1. The parameter space to search, captured by the value for the 'grid' key
  2. The strategy for splitting the labeled data into training and validation sets, specified by the 'type' key

Depending on the splitting scheme selected, the param_selection dictionary can contain other keys that define additional settings. The table below enumerates all the keys allowed in the dictionary.

Key Value
'grid'

A dictionary mapping each hyperparameter to a list of potential values to be searched. Here is an example grid for a logistic regression model:

{
  'penalty': ['l1', 'l2'],
  'C': [10, 100, 1000, 10000, 100000],
   'fit_intercept': [True, False]
}

See the reference links above for details on the hyperparameters available for each model.

'type'

The cross-validation methodology to use. One of:

'k' Number of folds (splits)

To identify the parameters that give the highest accuracy, the fit() method does an exhaustive grid search over the parameter space, evaluating candidate models using the specified cross-validation strategy. Subsequent calls to fit() can use these optimal parameters and skip the parameter selection process

  1. Custom Train/Test Settings
'train_label_set' (str)

A string representing a regex pattern that selects all training files for role model training with filenames that match the pattern. The default regex when this key is not specified is 'train.*\.txt'.

'test_label_set' (str)

A string representing a regex pattern that selects all evaluation files for role model testing with filenames that match the pattern. The default regex when this key is not specified is 'test.*\.txt'.

Training with custom configurations

To override MindMeld's default role classifier configuration with custom settings, you can either edit the app configuration file, or, you can call the fit() method with appropriate arguments.

1. Application configuration file

When you define custom classifier settings in config.py, the RoleClassifier.fit() and NaturalLanguageProcessor.build() methods use those settings instead of MindMeld's defaults. To do this, define a dictionary of your custom settings, named ROLE_CLASSIFIER_CONFIG.

Here's an example of a config.py file where custom settings optimized for the app override the preset configuration for the role classifier.

ROLE_CLASSIFIER_CONFIG = {
    'model_type': 'text',
    'model_settings': {'classifier_type': 'logreg'}
    'params': {
        'C': 10,
        'penalty': 'l2'
    },
    'features': {
        'bag-of-words-before': {
            'ngram_lengths_to_start_positions': {
                1: [-2, -1],
                2: [-2, -1]
            }
        },
        'bag-of-words-after': {
            'ngram_lengths_to_start_positions': {
                1: [0, 1],
                2: [0, 1]
            }
        },
        'other-entities': {}
    }
}

Settings defined in ROLE_CLASSIFIER_CONFIG apply to role classifiers across all entity types in your application. For finer-grained control, you can implement the get_role_classifier_config() function in config.py to specify suitable configurations for each entity. This gives you the flexibility to have customized configurations for different role classifiers based on the domain, intent, and entity type.

import copy

def get_role_classifier_config(domain, intent, entity):
    SPECIAL_CONFIG = copy.deepcopy(ROLE_CLASSIFIER_CONFIG)
    if domain == 'times_and_dates' and intent == 'change_alarms' and entity == 'sys_time':
        SPECIAL_CONFIG['params']['penalty'] = 'l1'
    return SPECIAL_CONFIG

Using config.py is recommended for storing your optimal classifier settings once you have identified them through experimentation. Then the classifier training methods will use the optimized configuration to rebuild the models. A common use case is retraining models on newly-acquired training data, without retuning the underlying model settings.

Since this method requires updating a file each time you modify a setting, it's less suitable for rapid prototyping than the method described next.

2. Arguments to the fit() method

For experimenting with the role classifier, the recommended method is to use arguments to the fit() method. The main areas for exploration are feature extraction and hyperparameter tuning.

Feature extraction

View the default feature set, as seen in the baseline classifier that we trained earlier. Notice that the 'ngram_lengths_to_start_positions' settings tell the classifier to extract n-grams within a context window of two tokens or less around the token of interest — that is, to only look at words in the immediate vicinity.

my_features = rc.config.features
my_features
{
  'bag-of-words-after': {'ngram_lengths_to_start_positions': {1: [0, 1], 2: [0, 1]}},
  'bag-of-words-before': {'ngram_lengths_to_start_positions': {1: [-2, -1], 2: [-2, -1]}},
  'other-entities': {}
}

Next, have the classifier look at a larger context window, and extract n-grams starting from tokens that are further away. We'll see whether that provides better information than the smaller default window. Do this by changing the 'ngram_lengths_to_start_positions' settings to extract all the unigrams and bigrams in a window of three tokens around the current token, as shown below.

my_features['bag-of-words-after']['ngram_lengths_to_start_positions'] = {
    1: [0, 1, 2, 3],
    2: [0, 1, 2]
}
my_features['bag-of-words-before']['ngram_lengths_to_start_positions'] = {
    1: [-3, -2, -1],
    2: [-3, -2, -1]
}
my_features
{
  'bag-of-words-after': {'ngram_lengths_to_start_positions': {1: [0, 1, 2, 3], 2: [0, 1, 2]}},
  'bag-of-words-before': {'ngram_lengths_to_start_positions': {1: [-3, -2, -1], 2: [-3, -2, -1]}},
  'other-entities': {}
}

Suppose wi represents the word at the ith index in the query, where the index is calculated relative to the start of the current entity span. Then, the above feature configuration should extract the following n-grams (w0 is the first token of the current entity).

  • Unigrams: { w-3, w-2, w-1, w0, w1, w2, w3 }
  • Bigrams: { w-3w-2, w-2w-1, w-1w0, w0w1, w1w2, w2w3 }

Retrain the classifier with the updated feature set by passing in the my_features dictionary as an argument to the features parameter of the fit() method. This applies our new feature extraction settings, while retaining the MindMeld defaults for model and classifier types (logreg) and hyperparameter selection.

rc.fit(features=my_features)
Fitting role classifier: domain='times_and_dates', intent='change_alarm', entity_type='sys_time'
No app configuration file found. Using default role model configuration

Hyperparameter tuning

View the model's hyperparameters, keeping in mind the hyperparameters for logistic regression, the default model for role classification in MindMeld. These include inverse of regularization strength as 'C', and the norm used in penalization as 'penalty'.

my_params = rc.config.params
my_params
{'C': 100, 'penalty': 'l1'}

Instead of relying on the default preset values for 'C' and 'penalty', let's specify a parameter search grid to let MindMeld select ideal values for the dataset. We'll also specify a cross-validation strategy. Update the parameter selection settings such that the hyperparameter estimation process chooses the ideal 'C' and 'penalty' parameters using 10-fold cross-validation:

search_grid = {
  'C': [1, 10, 100, 1000],
  'penalty': ['l1', 'l2']
}
my_param_settings = {
  'grid': search_grid,
  'type': 'k-fold',
  'k': 10
}

Pass the updated settings to fit() as an argument to the param_selection parameter. The fit() method then searches over the updated parameter grid, and prints the hyperparameter values for the model whose 10-fold cross-validation accuracy is highest.

rc.fit(param_selection=my_param_settings)
Fitting role classifier: domain='times_and_dates', intent='change_alarm', entity_type='sys_time'
No app configuration file found. Using default role model configuration
Selecting hyperparameters using k-fold cross validation with 10 splits
Best accuracy: 96.59%, params: {'C': 1, 'penalty': 'l2'}

Now we'll try a different cross-validation strategy: five randomized folds. Modify the values of the 'k' and 'type' keys in my_param_settings, and call fit() to see whether accuracy improves:

my_param_settings['k'] = 5
my_param_settings['type'] = 'shuffle'
my_param_settings
{
 'grid': {
           'C': [1, 10, 100, 1000],
           'penalty': ['l1', 'l2']
         },
 'k': 5,
 'type': 'shuffle'
}
rc.fit(param_selection=my_param_settings)
Fitting role classifier: domain='times_and_dates', intent='change_alarm', entity_type='sys_time'
No app configuration file found. Using default role model configuration
Selecting hyperparameters using shuffle cross validation with 5 splits
Best accuracy: 97.78%, params: {'C': 1, 'penalty': 'l2'}

For a list of configurable hyperparameters and cross-validation methods, see hyperparameter settings above.

Run the role classifier

Before you run the trained role classifier on a test query, you must first detect all the entities in the query using a trained entity recognizer:

query = 'Change my 6 AM alarm to 7 AM'
er = nlp.domains['times_and_dates'].intents['change_alarm'].entity_recognizer
entities = er.predict(query)
entities
(<QueryEntity '6 AM' ('sys_time') char: [10-13], tok: [2-3]>,
 <QueryEntity '7 AM' ('sys_time') char: [24-27], tok: [6-7]>)

Now you can choose an entity from among those detected, and call the role classifier's RoleClassifier.predict() method to classify it. Although it classifies a single entity, the RoleClassifier.predict() method uses the full query text, and information about all its entities, for feature extraction.

Run the trained role classifier on the two entities from the example above, one by one. The predict() method returns the label for the role whose predicted probability is highest.

rc.predict(query, entities, 0)
'old_time'
rc.predict(query, entities, 1)
'new_time'

Note

At runtime, the natural language processor's process() method calls RoleClassifier.predict() to roles for all detected entities in the incoming query.

We want to know how confident our trained model is in its prediction. To view the predicted probability distribution over all possible role labels, use the RoleClassifier.predict_proba() method. This is useful both for experimenting with classifier settings and for debugging classifier performance.

The result is a list of tuples whose first element is the role label and whose second element is the associated classification probability. These are ranked by roles, from most likely to least likely.

rc.predict_proba(query, entities, 0)
[('old_time', 0.9998281252873086), ('new_time', 0.00017187471269142218)]
rc.predict_proba(query, entities, 1)
[('new_time', 0.9999960507734881), ('old_time', 3.949226511944386e-06)]

An ideal classifier would assign a high probability to the expected (correct) class label for a test query, while assigning very low probabilities to incorrect labels.

The predict() and predict_proba() methods operate on one entity at a time. Next, we'll see how to test a trained model on a batch of labeled test queries.

Evaluate classifier performance

To evaluate the accuracy of your trained role classifier, you first need to create labeled test data, as described in the Natural Language Processor chapter. Once you have the test data files in the right place in your MindMeld project, you can measure your model's performance using the RoleClassifier.evaluate() method.

Before you can evaluate the accuracy of your trained role classifier, you must first create labeled test data and place it in your MindMeld project as described in the Natural Language Processor chapter.

Then, when you are ready, use the RoleClassifier.evaluate() method, which

  • strips away all ground truth annotations from the test queries,
  • passes the resulting unlabeled queries to the trained role classifier for prediction, and
  • compares the classifier's output predictions against the ground truth labels to compute the model's prediction accuracy.

In the example below, the model gets 20 out of 21 test queries correct, resulting in an accuracy of about 95%.

rc.evaluate()
Loading queries from file times_and_dates/change_alarm/test.txt
<StandardModelEvaluation score: 95.24%, 20 of 21 examples correct>

The aggregate accuracy score we see above is only the beginning, because the evaluate() method returns a rich object containing overall statistics, statistics by class, and a confusion matrix.

Print all the model performance statistics reported by the evaluate() method:

eval = rc.evaluate()
eval.print_stats()
Overall statistics:

    accuracy f1_weighted          tp          tn          fp          fn    f1_macro    f1_micro
       0.952       0.952          20          20           1           1       0.952       0.952



Statistics by class:

               class      f_beta   precision      recall     support          tp          tn          fp          fn
             old_time       0.957       0.917       1.000          11          11           9           1           0
             new_time       0.947       1.000       0.900          10           9          11           0           1



Confusion matrix:

                       old_time        new_time
        old_time             11              0
        new_time              1              9

The eval.get_stats() method returns all the above statistics in a structured dictionary without printing them to the console.

Let's decipher the statists output by the evaluate() method.

Overall Statistics

Aggregate stats measured across the entire test set:

accuracy Classification accuracy score
f1_weighted Class-weighted average f1 score
tp Number of true positives
tn Number of true negatives
fp Number of false positives
fn Number of false negatives
f1_macro Macro-averaged f1 score
f1_micro Micro-averaged f1 score

When interpreting these statistics, consider whether your app and evaluation results fall into one of the cases below, and if so, apply the accompanying guideline. This list is basic, not exhaustive, but should get you started.

  • Classes are balanced — When the number of annotations for each role are comparable and each role is equally important, focusing on the accuracy metric is usually good enough.
  • Classes are imbalanced — In this case, it's important to take the f1 scores into account.
  • All f1 and accuracy scores are low — When role classification is performing poorly across all roles, either of the following may be the problem: 1) You do not have enough training data for the model to learn, or 2) you need to tune your model hyperparameters.
  • f1 weighted is higher than f1 macro — This means that roles with fewer evaluation examples are performing poorly. Try adding more data to these roles.
  • f1 macro is higher than f1 weighted — This means that roles with more evaluation examples are performing poorly. Verify that the number of evaluation examples reflects the class distribution of your training examples.
  • f1 micro is higher than f1 macro — This means that some roles are being misclassified more often than others. Identify the problematic roles by checking the class-wise statistics below. Some roles may be too similar to others, or you may need to add more training data to some roles.
  • Some classes are more important than others — If some roles are more important than others for your use case, it is best to focus especially on the class-wise statistics described below.
Class-wise Statistics

Stats computed at a per-class level:

class Role label
f_beta F-beta score
precision Precision
recall Recall
support Number of test entities with this role (based on ground truth)
tp Number of true positives
tn Number of true negatives
fp Number of false positives
fn Number of false negatives
Confusion Matrix

A confusion matrix where each row represents the number of instances in an actual class and each column represents the number of instances in a predicted class. This reveals whether the classifier tends to confuse two classes, i.e., mislabel one class as another. In the above example, the role classifier wrongly classified one instance of a new_time entity as old_time.

Now we have a wealth of information about the performance of our classifier. Let's go further and inspect the classifier's predictions at the level of individual queries, to better understand error patterns.

View the classifier predictions for the entire test set using the results attribute of the returned eval object. Each result is an instance of the EvaluatedExample class, which contains information about the original input query, the expected ground truth label, the predicted label, and the predicted probability distribution over all the class labels.

eval.results
[
  EvaluatedExample(example=(<Query 'change my 6 am alarm'>, (<QueryEntity '6 am' ('sys_time') char: [10-13], tok: [2-3]>,), 0), expected='old_time', predicted='old_time', probas={'sys_time': 0.10062246873286373, 'old_time': 0.89937753126713627}, label_type='class'),
  EvaluatedExample(example=(<Query 'change my 6 am alarm to 7 am'>, (<QueryEntity '6 am' ('sys_time') char: [10-13], tok: [2-3]>, <QueryEntity '7 am' ('sys_time') char: [24-27], tok: [6-7]>), 0), expected='old_time', predicted='old_time', probas={'sys_time': 0.028607105880949835, 'old_time': 0.97139289411905017}, label_type='class'),
 ...
]

Next, we look selectively at just the correct or incorrect predictions.

list(eval.correct_results())
[
  EvaluatedExample(example=(<Query 'change my 6 am alarm'>, (<QueryEntity '6 am' ('sys_time') char: [10-13], tok: [2-3]>,), 0), expected='old_time', predicted='old_time', probas={'new_time': 0.10062246873286373, 'old_time': 0.89937753126713627}, label_type='class'),
  EvaluatedExample(example=(<Query 'change my 6 am alarm to 7 am'>, (<QueryEntity '6 am' ('sys_time') char: [10-13], tok: [2-3]>, <QueryEntity '7 am' ('sys_time') char: [24-27], tok: [6-7]>), 0), expected='old_time', predicted='old_time', probas={'new_time': 0.028607105880949835, 'old_time': 0.97139289411905017}, label_type='class'),
 ...
]
list(eval.incorrect_results())
[
  EvaluatedExample(example=(<Query 'replace the 8 am alarm with a 10 am alarm'>, (<QueryEntity '8 am' ('sys_time') char: [12-15], tok: [2-3]>, <QueryEntity '10 am' ('sys_time') char: [30-34], tok: [7-8]>), 1), expected='new_time', predicted='old_time', probas={'new_time': 0.48770513415754235, 'old_time': 0.51229486584245765}, label_type='class')
]

Slicing and dicing these results for error analysis is easily done with list comprehensions.

Our example dataset is fairly small, and we get just one case of misclassification. But for a real-world app with a large test set, we'd need to be able inspect incorrect predictions for a particular role. Try this using the new_time role from our example:

[(r.example, r.probas) for r in eval.incorrect_results() if r.expected == 'new_time']
[
  (
    (
      <Query 'replace the 8 am alarm with a 10 am alarm'>,
      (<QueryEntity '8 am' ('sys_time') char: [12-15], tok: [2-3]>, <QueryEntity '10 am' ('sys_time') char: [30-34], tok: [7-8]>),
      1
    ),
    {
      'new_time': 0.48770513415754235,
      'old_time': 0.51229486584245765
    }
  )
]

Next, we use a list comprehension to identify the kind of queries that the current training data might lack. To do this, we list all queries with a given role where the classifier's confidence for the true label was relatively low. We'll demonstrate this with the new_time role and a confidence of <60%.

[(r.example, r.probas) for r in eval.results if r.expected == 'new_time' and r.probas['new_time'] < .6]
[
  (
    (
      <Query 'replace the 8 am alarm with a 10 am alarm'>,
      (<QueryEntity '8 am' ('sys_time') char: [12-15], tok: [2-3]>, <QueryEntity '10 am' ('sys_time') char: [30-34], tok: [7-8]>),
      1
    ),
    {
      'new_time': 0.48770513415754235,
      'old_time': 0.51229486584245765
    }
  ),
  (
    (
      <Query 'cancel my 6 am and replace it with a 6:30 am alarm'>,
      (<QueryEntity '6 am' ('sys_time') char: [10-13], tok: [2-3]>, <QueryEntity '6:30 am' ('sys_time') char: [37-43], tok: [9-10]>),
      1
    ),
    {
      'new_time': 0.5872536946800766,
      'old_time': 0.41274630531992335
    }
  )
]

For both of these results, the classifier's prediction probability for the 'new_time' role was fairly low. The classifier got one of them wrong, and barely got the other one right with a confidence of about 59%.

Try looking at the training data. You should discover that the new_time role does indeed lack labeled training queries like the ones above.

One potential solution is to add more training queries for the new_time role, so the classification model can generalize better.

Error analysis on the results of the evaluate() method can inform your experimentation and help in building better models. Augmenting training data should be the first step, as in the above example. Beyond that, you can experiment with different model types, features, and hyperparameters, as described earlier in this chapter.

View features extracted for classification

While training a new model or investigating a misclassification by the classifier, it is sometimes useful to view the extracted features to make sure they are as expected. For example, there may be non-ASCII characters in the query that are treated differently by the feature extractors. Or the value assigned to a particular feature may be computed differently than you expected. Not extracting the right features could lead to misclassifications. In the example below, we view the features extracted for the query 'set alarm for 7 am' using RoleClassifier.view_extracted_features() method.

rc.view_extracted_features("set alarm for 7 am", entities, 0)
{'bag_of_words|ngram_before|length:1|pos:-2': 'alarm',
 'bag_of_words|ngram_before|length:1|pos:-1': 'for',
 'bag_of_words|ngram_before|length:2|pos:-2': 'alarm for',
 'bag_of_words|ngram_before|length:2|pos:-1': 'for 7',
 'bag_of_words|ngram_after|length:1|pos:0': 'am',
 'bag_of_words|ngram_after|length:1|pos:1': '<$>',
 'bag_of_words|ngram_after|length:2|pos:0': 'am <$>',
 'bag_of_words|ngram_after|length:2|pos:1': '<$> <$>'}

This is especially useful when you are writing custom feature extractors to inspect whether the right features are being extracted.

Save model for future use

Save the trained role classifier for later use by calling the RoleClassifier.dump() method. The dump() method serializes the trained model as a pickle file and saves it to the specified location on disk.

rc.dump(model_path='experiments/role_classifier.maxent.20170701.pkl')
Saving role classifier: domain='times_and_dates', intent='change_alarm', entity_type='sys_time'

You can load the saved model anytime using the RoleClassifier.load() method.

rc.load(model_path='experiments/role_classifier.maxent.20170701.pkl')
Loading role classifier: domain='times_and_dates', intent='change_alarm', entity_type='sys_time'