mindmeld.components.entity_recognizer module

This module contains the entity recognizer component of the MindMeld natural language processor.

class mindmeld.components.entity_recognizer.EntityRecognizer(resource_loader, domain, intent)[source]

Bases: mindmeld.components.classifier.Classifier

An entity recognizer which is used to identify the entities for a given query. It is trained using all the labeled queries for a particular intent. The labels are the entity annotations for each query.

domain

str -- The domain that this entity recognizer belongs to

intent

str -- The intent that this entity recognizer belongs to

entity_types

set -- A set containing the entity types which can be recognized

dump(model_path, incremental_model_path=None)[source]

Save the model.

Parameters:
  • model_path (str) -- The model path.
  • incremental_model_path (str, Optional) -- The timestamped folder where the cached models are stored.
fit(queries=None, label_set=None, incremental_timestamp=None, load_cached=True, **kwargs)[source]

Trains a statistical model for classification using the provided training examples and model configuration.

Parameters:
  • queries (list(ProcessedQuery) or ProcessedQueryList, optional) -- A list of queries to train on. If not specified the queries will be loaded from the label_set.
  • label_set (str) -- A label set to load. If not specified, the default training set will be loaded.
  • incremental_timestamp (str, optional) -- The timestamp folder to cache models in
  • model_type (str, optional) -- The type of machine learning model to use. If omitted, the default model type will be used.
  • model_settings (dict) -- Settings specific to the model type specified
  • features (dict) -- Features to extract from each example instance to form the feature vector used for model training. If omitted, the default feature set for the model type will be used.
  • params (dict) -- Params to pass to the underlying classifier
  • params_selection (dict) -- The grid of hyper-parameters to search, for finding the optimal hyper-parameter settings for the model. If omitted, the default hyper-parameter search grid will be used.
  • param_selection (dict) -- Configuration for param selection (using cross-validation) {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': { 'C': [100, 10000, 1000000]}}
  • features -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features.
  • load_cached (bool) -- If the model is cached on disk, load it into memory.
Returns:

True if model was loaded and fit, False if a valid cached model exists but was not loaded (controlled by the load_cached arg).

Examples

Fit using default the configuration.

>>> clf.fit()

Fit using a 'special' label set.

>>> clf.fit(label_set='special')

Fit using given params, bypassing cross-validation. This is useful for speeding up train times if you are confident the params are optimized.

>>> clf.fit(params={'C': 10000000})

Fit using given parameter selection settings (also known as cross-validation settings).

>>> clf.fit(param_selection={})

Fit using a custom set of features, including a custom feature extractor. This is only for advanced users.

>>> clf.fit(features={
        'in-gaz': {}, // gazetteer features
        'contrived': lambda exa, res: {'contrived': len(exa.text) == 26}
    })
get_entity_types(queries=None, label_set=None, **kwargs)[source]
inspect(query, gold_label=None, dynamic_resource=None)[source]
load(model_path)[source]

Loads the trained entity recognition model from disk.

Parameters:model_path (str) -- The location on disk where the model is stored.
predict(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]

Predicts entities for the given query using the trained recognition model.

Parameters:
  • query (Query, str) -- The input query.
  • time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information.
  • timestamp (long, optional) -- A unix time stamp for the request (in seconds).
  • dynamic_resource (dict, optional) -- A dynamic resource to aid NLP inference.
Returns:

The predicted class label.

Return type:

(str)

predict_proba(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]

Runs prediction on a given query and generates multiple entity tagging hypotheses with their associated probabilities using the trained entity recognition model

Parameters:
  • query (Query, str) -- The input query.
  • time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information.
  • timestamp (long, optional) -- A unix time stamp for the request (in seconds).
  • dynamic_resource (optional) -- Dynamic resource, unused.
Returns:

A list of tuples of the form (Entity list, float) grouping potential entity tagging hypotheses and their probabilities.

Return type:

(list)

unload()[source]

Unloads the model from memory. This helps reduce memory requirements while training other models.

CLF_TYPE = 'entity'

The classifier type.