Step 7: Train the Natural Language Processing Classifiers

The Natural Language Processor (NLP) in MindMeld is tasked with understanding the user's natural language input. It analyzes the input using a hierarchy of classification models. Each model assists the next tier of models by narrowing the problem scope, or in other words successively narrowing down the 'solution space.'

As introduced in Step 3, MindMeld applies four layers of classifiers in the following order:

  1. Domain Classifier classifies input into one of a pre-defined set of conversational domains. Only necessary for apps that handle conversations across varied topics, each with its own specialized vocabulary.
  2. Intent Classifiers determine what the user is trying to accomplish by assigning each input to one of the intents defined for your application.
  3. Entity Recognizers extract the words and phrases, or entities, that are required to fulfill the user's end goal.
  4. Role Classifiers assign a differentiating label, called a role, to the extracted entities. This level of categorization is only necessary where an entity of a particular type can have multiple meanings depending on the context.

Note

The code examples in this chapter assume that you have installed the Kwik-E-Mart and Home Assistant blueprint applications. See the blueprints overview page for details on installing the apps.

To train the NLP classifiers for our Kwik-E-Mart store information app, we must first gather the necessary training data as described in Step 6. Once the data is ready, we open a Python shell and start building the components of our natural language processor.

cd $MM_APP_ROOT
python

In the Python shell, the quickest way to train all the NLP classifiers together is to use the nlp.build() method.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor('.')
nlp.build()

This method trains all models in the specified NLP pipeline. The Natural Language Processor automatically infers which classifiers need to be trained based on the directory structure and the annotations in the training data. In our case, the NLP will train an intent classifier for the store_info domain and entity recognizers for each intent that contains labeled queries with entity annotations. Domain classification and role classification models will not be built because our simple example did not include training data for them.

To run all of the trained models in the NLP pipeline, use the nlp.process() command.

nlp.process('When does Elm Street close?')
{'text': 'When does Elm Street close?',
 'domain': 'store_info',
 'intent': 'get_store_hours',
 'entities': [{'text': 'Elm Street',
   'type': 'store_name',
   'role': None,
   'value': [{'cname': '23 Elm Street',
     'score': 44.777046,
     'top_synonym': 'Elm Street',
     'id': '1'},
    {'cname': '104 First Street',
     'score': 7.0927515,
     'top_synonym': '104 First Street',
     'id': '5'},
    {'cname': 'East Oak Street',
     'score': 7.0927515,
     'top_synonym': 'East Oak Street',
     'id': '12'},
    {'cname': '257th Street',
     'score': 6.958622,
     'top_synonym': '257th Street',
     'id': '18'},
    {'cname': 'D Street',
     'score': 6.7008686,
     'top_synonym': 'D Street',
     'id': '19'},
    {'cname': '181st Street',
     'score': 6.630241,
     'top_synonym': '181st Street',
     'id': '17'},
    {'cname': 'West Oak Street',
     'score': 6.249679,
     'top_synonym': 'West Oak Street',
     'id': '11'},
    {'cname': '156th Street',
     'score': 6.1613703,
     'top_synonym': '156th Street',
     'id': '15'},
    {'cname': 'Peanut Street',
     'score': 6.1613703,
     'top_synonym': 'Peanut Street',
     'id': '20'},
    {'cname': 'Little Italy Store',
     'score': 5.2708626,
     'top_synonym': 'Third Street',
     'id': '7'}],
   'span': {'start': 10, 'end': 19}}
   ]
}

The nlp.process() command returns detailed information about the output of each of the trained NLP models. See the User Guide for details.

By default, the build() method shown above uses the baseline machine learning settings for all classifiers, which should train reasonable models in most cases. To further improve model performance, MindMeld provides extensive capabilities for optimizing individual model parameters and measuring results. We'll next explore how to experiment with different settings for each NLP component individually.

Domain Classification

The domain classifier (also called the domain model) is a text classification model that is trained using the labeled queries across all domains. Our simple app only has one domain and hence does not need a domain classifier. However, complex conversational apps such as the popular virtual assistants on smartphones and smart speakers today have to handle queries from varied domains such as weather, navigation, sports, finance, and music, among others. Such apps use domain classification as the first step to narrow down the focus of the subsequent classifiers in the NLP pipeline.

To see the domain classifier in action, you can download and try out the home_assistant blueprint application.

import mindmeld as mm
mm.configure_logs()
mm.blueprint('home_assistant')

The NaturalLanguageProcessor class in MindMeld exposes methods for training, testing, and saving all the models in our classifier hierarchy, including the domain model. For example, suppose we want to build a logistic regression classifier that does domain classification. In our Python shell, we start off by instantiating an object of the NaturalLanguageProcessor class. We then train the domain_classifier model by calling its fit() method.

Note

Since our simple Kwik-E-Mart app does not have a domain classifier, the example below uses the Home Assistant blueprint to demonstrate the functionality.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor('home_assistant')
nlp.domain_classifier.fit(model_settings={'classifier_type': 'logreg'})

We test the trained classifier on a new query using the predict() method.

nlp.domain_classifier.predict('close the kitchen door')
'smart_home'

To view the classification probabilities associated with all available domains, we can use the predict_proba() method.

nlp.domain_classifier.predict_proba('close the kitchen door')
[
 ('smart_home', 0.9999634367987815),
 ('times_and_dates', 1.81768265134388e-05),
 ('weather', 1.2388247900671112e-05),
 ('unknown', 4.110616819853133e-06),
 ('greeting', 1.8875099844624723e-06)
]

In addition to the model parameter we used above, the fit() method also takes parameters we can use to improve upon the baseline SVM model trained by default. These include parameters for features, cross-validation settings, and other model-specific configuration. See the User Guide for details.

Intent Classification

Intent classifiers (also called intent models) are text classification models that are trained, one-per-domain, using the labeled queries in each intent folder. Our Kwik-E-Mart app supports multiple intents (e.g. greet, get_store_hours, find_nearest_store, etc.) within the store_info domain. We will now see how to train an intent classifier that correctly maps user queries to one of these supported intents.

Training our intent model is similar to training the domain model using the NaturalLanguageProcessor class, but this time we explicitly define the features and cross-validation settings we want to use. For our intent classifier, let us assume that we want to build a logistic regression model and use bag of words and edge n-grams as features. Also, we would like to do k-fold cross validation with 10 splits to find the ideal hyperparameter values.

We demonstrate intent classification using the simpler Kwik-E-Mart application. We start as before by instantiating a NaturalLanguageProcessor object.

cd $MM_APP_ROOT
python
from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor('.')

Next, we define the feature dictionary that lists all the feature types along with the feature-specific settings. Let's say we want bag-of-n-grams up to size 2 and edge-ngrams up to length 2.

feature_dict = {
    'bag-of-words': { 'lengths': [1, 2] },
    'edge-ngrams': { 'lengths': [1, 2] }
}

We then define the hyperparameter selection settings.

search_grid = {
  'C': [0.01, 1, 10, 100, 1000],
  'class_bias': [0, 0.3, 0.7, 1]
}

hyperparam_settings = {
  'type': 'k-fold',
  'k': 10,
  'grid': search_grid
}

Finally, we fetch the intent_classifier for the domain we are interested in and call its fit() method to train the model. The code below shows how to train an intent classifier for the store_info domain in our Kwik-E-Mart app.

clf = nlp.domains['store_info'].intent_classifier
clf.fit(model_settings={'classifier_type': 'logreg'},
        features=feature_dict,
        param_selection=hyperparam_settings)

We have now successfully trained an intent classifier for the store_info domain. If our app had more domains, we would follow the same procedure for those other domains. We can test the trained intent model on a new query by calling its predict() and predict_proba() methods.

clf.predict('Where is my closest Kwik-E-Mart?')
'find_nearest_store'
clf.predict_proba('Where is my closest Kwik-E-Mart?')
[
    ('find_nearest_store', 0.999995),
    ('get_store_hours', 0.000005),
    ('greet', 0.000000),
    ('exit', 0.000000),
    ('help', 0.000000)
]

Once we have experimented with different settings and have an optimized intent model that we are happy with, we persist the trained model to a local file using the dump() method.

my_app_dump = 'models/experimentation/intent_model_logreg.pkl'
clf.dump(my_app_dump)

See the User Guide for a comprehensive list of the different model, feature extraction and hyperparameter settings for training the domain and intent models. The User Guide also describes how to evaluate trained models using labeled test data.

Entity Recognition

Entity recognizers (also called entity models) are sequence labeling models that are trained per intent using all the annotated queries in a particular intent folder in the domains directory. The entity recognizer detects the entities within a query, and labels them as one of the pre-defined entity types.

From the model hierarchy we defined for our Kwik-E-Mart app in Step 3, we can see that the get_store_hours intent depends on two types of entities. Of these, sys_time is a system entity that MindMeld recognizes automatically. The store_name entity, on the other hand, requires custom training data and a trained entity model. Let's look at how to use the NaturalLanguageProcessor class to train entity recognizers for detecting custom entities in user queries.

In this example we use a Maximum Entropy Markov Model, which is a good choice for sequence labeling tasks like entity recognition. The features we use include a gazetteer , which is a comprehensive list of popular entity names. Gazetteers are among the most powerful and commonly used sources of information in entity recognition models. Our example gazetteer for the store_name entity type is a list of all the Kwik-E-Mart store names in our catalog, stored in a text file called gazetteer.txt and located in the appropriate subdirectory of the entities folder. MindMeld automatically utilizes any gazetteer named gazetteer.txt that is located within an entity folder. The example gazetteer file looks like this:

3rd Street
Central Plaza
East Oak Street
Elm Street
Evergreen Terrace
Main Street
Main and Market
Market Square
Shelbyville
Spalding Way
Springfield Mall
...

If we had more entity types, we would have gazetteer lists for them, too.

When words in a query fully or partly match a gazetteer entry, that can be used to derive features. This makes gazetteers particularly helpful for detecting entities which might otherwise seem to be a sequence of common nouns, such as main street, main and market, and so on. Apart from using gazetteer-based features, we'll use the bag of n-grams surrounding the token as additional features. Finally, we'll continue using 10-fold cross validation as before.

Below is the code to instantiate a NaturalLanguageProcessor object, define the features, and the hyperparameter selection settings.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor('.')
feature_dict = {
  'in-gaz-span-seq': {},
  'bag-of-words-seq':{
      'ngram_lengths_to_start_positions': {
          1: [-1, 0, 1],
          2: [-1, 0, 1]
      }
  }
}
search_grid = {
  'C': [0.01, 1, 10, 100, 1000],
  'penalty': ['l1', 'l2']
}
hyperparam_settings = {
  'type': 'k-fold',
  'k': 10,
  'grid': search_grid
}

Next, we get the entity recognizer for the desired intent and invoke its fit() method. We also serialize the trained model to disk for future use.

recognizer = nlp.domains['store_info'].intents['get_store_hours'].entity_recognizer
recognizer.fit(model_settings={'classifier_type': 'memm'},
               features=feature_dict,
               param_selection=hyperparam_settings)
recognizer.dump('models/experimentation/entity_model_memm.pkl')

We have now trained and saved the entity recognizer for the get_store_hours intent. If more entity recognizers were required, we would have repeated the same procedure for each entity in each intent. We test the trained entity recognizer using its predict() method.

recognizer.predict('When does the store on Elm Street close?')
(<QueryEntity 'Elm Street' ('store_name') char: [23-32], tok: [5-6]>,)

See the User Guide for more about entity recognizer training and evaluation options.

Role Classification

Role classifiers (also called role models) are trained per entity using all the annotated queries in a particular intent folder. Roles offer a way to assign an additional distinguishing label to entities of the same type. Our simple Kwik-E-Mart application does not need a role classification layer. However, consider a possible extension to our app, where users can search for stores that open and close at specific times. As we saw in the example in Step 6, this would require us to differentiate between the two sys_time entities by recognizing one as an open_time and the other as a close_time. This can be accomplished by training an entity-specific role classifier that assigns the correct role label for each such sys_time entity detected by the Entity Recognizer.

Let's walk through the process of using MindMeld to train a role classifier for the sys_time entity type. The workflow is just like the previous classifiers: instantiate a NaturalLanguageProcessor object; access the classifier of interest (in this case, the role_classifier for the sys_time entity); define the machine learning settings; and, call the fit() method of the classifier. For this example, we will just use MindMeld's default configuration (Logistic Regression) to train a baseline role classifier without specifying any additional training settings. For the sake of code readability, we retrieve the classifier of interest in two steps: first get the object representing the current intent, then fetch the role_classifier object of the appropriate entity under that intent.

Note

The Kwik-E-Mart blueprint distributed with MindMeld does not use role classification. The code snippet below shows a possible extension to the app where the sys_time entity is further classified into two different roles.

For an example you can run readily, see the Home Assistant example further below.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor('.')
get_hours_intent = nlp.domains['store_info'].intents['get_store_hours']
# MindMeld doesn't know about entities until the training queries have been loaded.
# Load queries for the relevant intent by calling build().
get_hours_intent.build()
# Get the role classifier for the 'sys_time' entity
clf = get_hours_intent.entities['sys_time'].role_classifier
clf.fit()

Once the classifier is trained, we test it on a new query using the familiar predict() method. The predict() method of the role classifier requires both the full input query and the set of entities predicted by the entity recognizer.

query = 'Show me stores open between 8 AM and 6 PM.'
recognizer = get_hours_intent.entity_recognizer
predicted_entities = recognizer.predict(query)
clf.predict(query, predicted_entities, 0)
'open_time'

Here is a different example of role classification from the Home Assistant blueprint. The home assistant app leverages roles to correctly implement the functionality of changing alarms, e.g. "Change my 6 AM alarm to 7 AM".

First, we train the role classifier.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor(app_path='home_assistant')
change_alarm_intent = nlp.domains['times_and_dates'].intents['change_alarm']
change_alarm_intent.build()
clf = change_alarm_intent.entities['sys_time'].role_classifier
clf.fit()

We can then test the classifier on a new query.

query = 'Change my 6 AM alarm to 7 AM'
recognizer = change_alarm_intent.entity_recognizer
predicted_entities = recognizer.predict(query)
clf.predict(query, predicted_entities, 0)
'old_time'
clf.predict(query, predicted_entities, 1)
'new_time'

We can further optimize our baseline role classifier using the training and evaluation options detailed in the User Guide.

Entity Resolution

The entity resolver component of MindMeld maps each identified entity to a canonical value. For example, if your application is used for browsing TV shows, you may want to map both entity strings funny and hilarious to a pre-defined genre code like Comedy. Similarly, in a music app, you may want to resolve both Elvis and The King to the artist Elvis Presley (ID=20192), while making sure not to get confused by Elvis Costello (ID=139028). Entity resolution can be straightforward for some classes of entities. For others, it can be complex enough to constitute the dominant factor limiting the overall accuracy of your application.

MindMeld provides advanced capabilities for building a state-of-the-art entity resolver. As discussed in Step 6, each entity type can be associated with an optional entity mapping file. This file specifies, for each canonical concept, the alternate names or synonyms with which a user may refer to this concept. In the absence of an entity mapping file, the entity resolver cannot resolve the entity. Instead, it logs a warning and skips adding a value attribute to the entity. For example, the following code illustrates the output of the natural language processor when an entity mapping data file is absent for the store_name entity:

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor('.')
nlp.build()
nlp.process("When does the one on elm open?")
Failed to resolve entity 'elm' for type 'store_name'
{
  'domain': 'store_info',
  'entities': [
    {
      'role': None,
      'span': {'end': 23, 'start': 21},
      'text': 'elm',
      'type': 'store_name'
     }
  ],
  'intent': 'get_store_hours',
  'text': 'When does the one on elm open?'
}

If an entity mapping file is specified, as illustrated in Step 6, the entity resolver resolves the entity to a defined ID and canonical name. It assigns these to the value attribute of the entity, in the form of an object. Then the output of the natural language processor could resemble the following.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor('.')
nlp.build()
nlp.process("When does the one on elm open?")
{
  'domain': 'store_info',
  'entities': [
    {
      'role': None,
      'span': {'end': 23, 'start': 21},
      'text': 'elm',
      'type': 'store_name',
      'value': [{'cname': '23 Elm Street', 'id': '1'}],
     }
  ],
  'intent': 'get_store_hours',
  'text': 'When does the one on elm open?'
}

As with the other NLP components in MindMeld, you can access the individual resolvers for each entity type.

The code below illustrates how to train and evaluate the entity resolver model for the store_name entity.

from mindmeld.components.nlp import NaturalLanguageProcessor
nlp = NaturalLanguageProcessor('.')
# MindMeld doesn't know about entities until the training queries have been loaded.
# Load queries for the relevant intent by calling build().
nlp.domains['store_info'].intents['get_store_hours'].build()
# Get the entity resolver for the entity type of interest.
resolver = nlp.domains['store_info'].intents['get_store_hours'].entities['store_name'].entity_resolver

# Train the resolver model using the mapping file, if available.
resolver.fit()

# Run the model on a detected entity
recognizer = nlp.domains['store_info'].intents['get_store_hours'].entity_recognizer
entities = recognizer.predict('When does the store on Elm Street close?')
resolver.predict(entities[0])
[{'cname': '23 Elm Street', 'score': 40.69433, 'top_synonym': 'Elm Street', 'id': '1'}, ...]

See the User Guide for more about how to evaluate and optimize entity resolution models.