mindmeld.models.taggers.crf module¶

This module contains the CRF entity recognizer.

class mindmeld.models.taggers.crf.CRFTagger(**parameters)[source]¶

Bases: mindmeld.models.taggers.taggers.Tagger

A Conditional Random Fields model.

dump(path)[source]¶

Dumps any tagger specific data to disk and returns a model_path (modified if required). This is a no-op since we do not have to do anything special to dump default serializable models for SKLearn.

Parameters:	model_path (str) -- The path to dump the model to

static extract_example_features(example, config, resources)[source]¶

Extracts feature dicts for each token in an example.

Parameters:	example (mindmeld.core.Query) -- A query. config (ModelConfig) -- The ModelConfig which may contain information used for feature extraction. resources (dict) -- Resources which may be used for this model's feature extraction.
Returns:	Features.
Return type:	list[dict]

extract_features(examples, config, resources, y=None, fit=False, in_memory=True)[source]¶

Transforms a list of examples into a feature matrix.

Parameters:	examples (list of mindmeld.core.Query) -- a list of queries config (ModelConfig) -- The ModelConfig which may contain information used for feature extraction resources (dict) -- Resources which may be used for this model's feature extraction
Returns:	features in CRF suite format
Return type:	(list of list of str)

fit(X, y)[source]¶

Trains the model. X and y are the format of what is returned by extract_features. There is no restriction on their type or content. X should be the fully processed data with extracted features that are ready to be used to train the model. y should be a list of classes as encoded by the label_encoder

Parameters:	X (list) -- Generally a list of feature vectors, one for each training example y (list) -- A list of classification labels (encoded by the label_encoder, NOT MindMeld entity objects)
Returns:	self

get_params(deep=True)[source]¶

Gets a dictionary of all of the current model parameters and their values

Parameters:	deep (bool) -- Not used, needed for sklearn compatibility
Returns:	A dictionary of the model parameter names as keys and their set values
Return type:	(dict)

get_torch_encoder()[source]¶

load(path)[source]¶

Load the model state to memory. This is a no-op since we do not have to do anything special to load default serializable models for SKLearn.

Parameters:	model_path (str) -- The path to dump the model to

predict(X, dynamic_resource=None)[source]¶

Predicts the labels from a feature matrix X. Again X is the format of what is returned by extract_features.

Parameters:	X (list) -- A list of feature vectors, one for each example
Returns:	a list of predicted labels (in an encoded format)
Return type:	(list of classification labels)

predict_proba(examples, config, resources)[source]¶

Parameters:	examples (list of mindmeld.core.Query) -- a list of queries to predict on config (ModelConfig) -- The ModelConfig which may contain information used for feature extraction resources (dict) -- Resources which may be used for this model's feature extraction
Returns:	a list of predicted labels with confidence scores
Return type:	list of tuples of (mindmeld.core.QueryEntity)

predict_proba_distribution(examples, config, resources)[source]¶

Parameters:	examples (list of mindmeld.core.Query) -- a list of queries to predict on config (ModelConfig) -- The ModelConfig which may contain information used for feature extraction resources (dict) -- Resources which may be used for this model's feature extraction
Returns:	a list of predicted labels with confidence scores
Return type:	list of list of ((list of str) and (list of float))

set_params(**parameters)[source]¶

Sets the model parameters. Defaults should be set for all parameters such that a model is initialized with reasonable default parameters if none are explicitly passed in.

Parameters:	**parameters -- Arbitrary keyword arguments. The keys are model parameter names and the values are what they should be set to
Returns:	self

set_torch_encoder(encoder)[source]¶

setup_model(config)[source]¶: "Not implemented.

is_serializable¶

class mindmeld.models.taggers.crf.FeatureBinner[source]¶

Bases: object

Class to convert features with numerical values to categorical values.

fit(X_train)[source]¶

Create and fit FeatureMapper for numerical features.

Parameters:	X_train (list of list of dict) -- training data

fit_transform(X_train)[source]¶

Run fit and transform at once.

Parameters:	X_train (list of list of dict) -- training data

transform(X_train)[source]¶

Convert numerical values to categorical values.

Parameters:	X_train (list of list of dict) -- training data

class mindmeld.models.taggers.crf.FeatureMapper(num_std=2, size_std=0.5)[source]¶

Bases: object

Mapper for one feature to map numerical values to corresponding bins which are generated by the mean and standard deviation of this feature.

The size and number of bins are decided by num_std and size_std. For example, say num_std = 2 and size_std = 0.5, then the bins would look like:

bucket 0: (-INF, mean - std * 2)
bucket 1: [mean - std * 2, mean - std * 1.5)
bucket 2: [mean - std * 1.5, mean - std * 1)
...
bucket 8: [mean + std * 1.5, mean + std * 2)
bucket 9: [mean + std * 2, INF)

_num_std¶: int -- number of standard deviations to generate the bins

_size_std¶: float -- size of each bin in standard deviation

add_value(value)[source]¶

Collect values for this feature.

Parameters:	value (numeric) -- A numeric value

fit()[source]¶: Calculate statistics and then create the bins.

map_bucket(value)[source]¶

Get corresponding bucket number for this value.

Parameters:	value (float) -- numerical value of this feature