mindmeld.components.entity_recognizer module¶

This module contains the entity recognizer component of the MindMeld natural language processor.

class mindmeld.components.entity_recognizer.EntityRecognizer(resource_loader, domain, intent)[source]¶

Bases: mindmeld.components.classifier.Classifier

An entity recognizer which is used to identify the entities for a given query. It is trained using all the labeled queries for a particular intent. The labels are the entity annotations for each query.

domain¶: str -- The domain that this entity recognizer belongs to

intent¶: str -- The intent that this entity recognizer belongs to

entity_types¶: set -- A set containing the entity types which can be recognized

dump(model_path, incremental_model_path=None)[source]¶

Save the model.

Parameters:	model_path (str) -- The model path. incremental_model_path (str, Optional) -- The timestamped folder where the cached models are stored.

fit(queries=None, label_set=None, incremental_timestamp=None, load_cached=True, **kwargs)[source]¶

Trains a statistical model for classification using the provided training examples and model configuration.

Parameters:

queries (list(ProcessedQuery) or ProcessedQueryList, optional) -- A list of queries to train on. If not specified the queries will be loaded from the label_set.
label_set (str) -- A label set to load. If not specified, the default training set will be loaded.
incremental_timestamp (str, optional) -- The timestamp folder to cache models in
model_type (str, optional) -- The type of machine learning model to use. If omitted, the default model type will be used.
model_settings (dict) -- Settings specific to the model type specified
features (dict) -- Features to extract from each example instance to form the feature vector used for model training. If omitted, the default feature set for the model type will be used.
params (dict) -- Params to pass to the underlying classifier
params_selection (dict) -- The grid of hyper-parameters to search, for finding the optimal hyper-parameter settings for the model. If omitted, the default hyper-parameter search grid will be used.
param_selection (dict) -- Configuration for param selection (using cross-validation) {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': { 'C': [100, 10000, 1000000]}}
features -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features.
load_cached (bool) -- If the model is cached on disk, load it into memory.

Returns:

True if model was loaded and fit, False if a valid cached model exists but was not loaded (controlled by the load_cached arg).

Examples

Fit using default the configuration.

>>> clf.fit()

Fit using a 'special' label set.

>>> clf.fit(label_set='special')

Fit using given params, bypassing cross-validation. This is useful for speeding up train times if you are confident the params are optimized.

>>> clf.fit(params={'C': 10000000})

Fit using given parameter selection settings (also known as cross-validation settings).

>>> clf.fit(param_selection={})

Fit using a custom set of features, including a custom feature extractor. This is only for advanced users.

>>> clf.fit(features={
        'in-gaz': {}, // gazetteer features
        'contrived': lambda exa, res: {'contrived': len(exa.text) == 26}
    })

get_entity_types(queries=None, label_set=None, **kwargs)[source]¶

inspect(query, gold_label=None, dynamic_resource=None)[source]¶

load(model_path)[source]¶

Loads the trained entity recognition model from disk.

Parameters:	model_path (str) -- The location on disk where the model is stored.

predict(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]¶

Predicts entities for the given query using the trained recognition model.

Parameters:	query (Query, str) -- The input query. time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information. timestamp (long, optional) -- A unix time stamp for the request (in seconds). dynamic_resource (dict, optional) -- A dynamic resource to aid NLP inference.
Returns:	The predicted class label.
Return type:	(str)

predict_proba(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]¶

Runs prediction on a given query and generates multiple entity tagging hypotheses with their associated probabilities using the trained entity recognition model

Parameters:	query (Query, str) -- The input query. time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information. timestamp (long, optional) -- A unix time stamp for the request (in seconds). dynamic_resource (optional) -- Dynamic resource, unused.
Returns:	A list of tuples of the form (Entity list, float) grouping potential entity tagging hypotheses and their probabilities.
Return type:	(list)

unload()[source]¶: Unloads the model from memory. This helps reduce memory requirements while training other models.

CLF_TYPE = 'entity'¶: The classifier type.