mindmeld.components.entity_resolver module

This module contains the entity resolver component of the MindMeld natural language processor.

class mindmeld.components.entity_resolver.BaseEntityResolver(app_path, entity_type, resource_loader=None, **_kwargs)[source]

Bases: abc.ABC

Base class for Entity Resolvers

dump(model_path, incremental_model_path=None)[source]

Persists the trained classification model to disk. The state for an embedder based model is the cached embeddings whereas for text features based resolvers, (if required,) it will generally be a serialized pickle of the underlying model/algorithm and the data associated.

In general, this method leads to creation of the following files:
  • .configs.pkl: pickle of the resolver's configuarble parameters
  • .pkl.hash: a hash string obtained from a combination of KB data and the config params
  • .pkl (optional, for non-ES models): pickle of the underlying model/algo state
  • .embedder_cache.pkl (optional, for embedder models): pickle of underlying embeddings
Parameters:
  • model_path (str) -- A .pkl file path where the resolver will be dumped. The model hash will be dumped at {path}.hash file path
  • incremental_model_path (str, optional) -- The timestamp folder where the cached models are stored.
fit(clean=False, entity_map=None)[source]

Fits the resolver model, if required

Parameters:
  • clean (bool, optional) -- If True, deletes and recreates the index from scratch with synonyms in the mapping.json.
  • entity_map (Dict[str, Union[str, List]]) -- Entity map if passed in directly instead of loading from a file path
Raises:

EntityResolverError -- if the resolver cannot be fit with the loaded/passed-in data

entity_map = {

"some_optional_key": "value", "entities": [

{
"id": "B01MTUORTQ", "cname": "Seaweed Salad", "whitelist": [...],

],

}

load(path, entity_map=None)[source]

Loads state of the entity resolver as well the KB data. The state for embedder model is the cached embeddings whereas for text features based resolvers, (if required,) it will generally be a serialized pickle of the underlying model/algorithm. There is no state as such for Elasticsearch resolver to be dumped.

Parameters:
  • path (str) -- A .pkl file path where the resolver has been dumped
  • entity_map (Dict[str, Union[str, List]]) -- Entity map if passed in directly instead of loading from a file path
Raises:

EntityResolverError -- if the resolver cannot be loaded from the specified path

load_deprecated()[source]

A method to handle the deprecated way of using the .load() method in entity resolvers. This ensures backwards compatibility when loading models that were built using an older version of Mindmeld i.e a version <=4.4.0. Since no hash pickle file is dumped in the older version of MindMeld, using the latest .load() method throws a FileNotFoundError.

predict(entity_or_list_of_entities, top_n=20, allowed_cnames=None)[source]

Predicts the resolved value(s) for the given entity using the loaded entity map or the trained entity resolution model.

Parameters:
  • entity_or_list_of_entities (Entity, tuple[Entity], str, tuple[str]) -- One or more entity query strings or Entity objects that needs to be resolved.
  • top_n (int, optional) -- maximum number of results to populate. If specifically inputted as 0 or None, results in an unsorted list of results in case of embedder and tfidf entity resolvers. This is sometimes helpful when a developer wishes to do some wrapper operations on top of unsorted results, such as combining scores from multiple resolvers and then sorting, etc.
  • allowed_cnames (Iterable, optional) -- if inputted, predictions will only include objects related to these canonical names
Returns:

The top n resolved values for the provided entity.

Return type:

(list)

Raises:

EntityResolverError -- if unable to obtain predictions for the given input

unload()[source]

Unloads the model from memory. This helps reduce memory requirements while training other models.

resolver_configurations
class mindmeld.components.entity_resolver.ElasticsearchEntityResolver(app_path, entity_type, **kwargs)[source]

Bases: mindmeld.components.entity_resolver.BaseEntityResolver

Resolver class based on Elastic Search

static ingest_synonym(app_namespace, index_name, index_type='syn', field_name=None, data=None, es_host=None, es_client=None, use_double_metaphone=False)[source]

Loads synonym documents from the mapping.json data into the specified index. If an index with the specified name doesn't exist, a new index with that name will be created.

Parameters:
  • app_namespace (str) -- The namespace of the app. Used to prevent collisions between the indices of this app and those of other apps.
  • index_name (str) -- The name of the new index to be created.
  • index_type (str) -- specify whether to import to synonym index or knowledge base object index. INDEX_TYPE_SYNONYM is the default which indicates the synonyms to be imported to synonym index, while INDEX_TYPE_KB indicates that the synonyms should be imported into existing knowledge base index.
  • field_name (str) -- specify name of the knowledge base field that the synonym list corresponds to when index_type is INDEX_TYPE_SYNONYM.
  • data (list) -- A list of documents to be loaded into the index.
  • es_host (str) -- The Elasticsearch host server.
  • es_client (Elasticsearch) -- The Elasticsearch client.
  • use_double_metaphone (bool) -- Whether to use the phonetic mapping or not.
load_deprecated()[source]

A method to handle the deprecated way of using the .load() method in entity resolvers. This ensures backwards compatibility when loading models that were built using an older version of Mindmeld i.e a version <=4.4.0. Since no hash pickle file is dumped in the older version of MindMeld, using the latest .load() method throws a FileNotFoundError.

ES_SYNONYM_INDEX_PREFIX = 'synonym'

The prefix of the ES index.

resolver_configurations
class mindmeld.components.entity_resolver.EmbedderCosSimEntityResolver(app_path, entity_type, **kwargs)[source]

Bases: mindmeld.components.entity_resolver.BaseEntityResolver

Resolver class for embedder models that create dense embeddings

get_processed_entity_map(entity_map)[source]

Processes the entity map into a format suitable for indexing and similarity searching

Parameters:entity_map (Dict[str, Union[str, List]]) -- Entity map if passed in directly instead of loading from a file path
Returns:
A processed entity map better suited for indexing and
querying
Return type:processed_entity_map (Dict)
load_deprecated()[source]

A method to handle the deprecated way of using the .load() method in entity resolvers. This ensures backwards compatibility when loading models that were built using an older version of Mindmeld i.e a version <=4.4.0. Since no hash pickle file is dumped in the older version of MindMeld, using the latest .load() method throws a FileNotFoundError.

predict_batch(entity_list, top_n: int = 20, batch_size: int = 8)[source]
resolver_configurations
class mindmeld.components.entity_resolver.EntityResolver[source]

Bases: object

Class for backwards compatibility

deprecated usage
>>> entity_resolver = EntityResolver(
        app_path, resource_loader, entity_type
    )
new usage
>>> entity_resolver = EntityResolverFactory.create_resolver(
        app_path, entity_type
    )
# or ...
>>> entity_resolver = EntityResolverFactory.create_resolver(
        app_path, entity_type, resource_loader=resource_loader
    )
class mindmeld.components.entity_resolver.EntityResolverFactory[source]

Bases: object

classmethod create_resolver(app_path, entity_type, config=None, resource_loader=None, **kwargs)[source]
Identifies appropriate entity resolver based on input config and
returns it.
Parameters:
  • app_path (str) -- The application path.
  • entity_type (str) -- The entity type associated with this entity resolver.
  • resource_loader (ResourceLoader) -- An object which can load resources for the resolver.
  • er_config (dict) -- A classifier config
  • es_host (str) -- The Elasticsearch host server.
  • es_client (Elasticsearch) -- The Elasticsearch client.
class mindmeld.components.entity_resolver.ExactMatchEntityResolver(app_path, entity_type, **kwargs)[source]

Bases: mindmeld.components.entity_resolver.BaseEntityResolver

Resolver class based on exact matching

get_processed_entity_map(entity_map)[source]

Processes the entity map into a format suitable for indexing and similarity searching

Parameters:entity_map (Dict[str, Union[str, List]]) -- Entity map if passed in directly instead of loading from a file path
Returns:
A processed entity map better suited for indexing and
querying
Return type:processed_entity_map (Dict)
load_deprecated()[source]

A method to handle the deprecated way of using the .load() method in entity resolvers. This ensures backwards compatibility when loading models that were built using an older version of Mindmeld i.e a version <=4.4.0. Since no hash pickle file is dumped in the older version of MindMeld, using the latest .load() method throws a FileNotFoundError.

resolver_configurations
class mindmeld.components.entity_resolver.SentenceBertCosSimEntityResolver(app_path, entity_type, **kwargs)[source]

Bases: mindmeld.components.entity_resolver.EmbedderCosSimEntityResolver

Resolver class for bert models based on the sentence-transformers library https://github.com/UKPLab/sentence-transformers

class mindmeld.components.entity_resolver.TfIdfSparseCosSimEntityResolver(app_path, entity_type, **kwargs)[source]

Bases: mindmeld.components.entity_resolver.BaseEntityResolver

a tf-idf based entity resolver using sparse matrices. ref: scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

find_similarity(src_texts, top_n=20, scores_normalizer=None, _return_as_dict=False, _no_sort=False)[source]

Computes sparse cosine similarity

Parameters:
  • src_texts (Union[str, list]) -- string or list of strings to obtain matching scores for.
  • top_n (int, optional) -- maximum number of results to populate. if None, equals length of self._syn_tfidf_matrix
  • scores_normalizer -- normalizer type to normalize scores. Allowed values are: "min_max_scaler", "standard_scaler"
Returns:

if _return_as_dict, returns a dictionary of tgt_texts and

their scores, else a list of sorted synonym names paired with their similarity scores (descending order)

Return type:

Union[dict, list[tuple]]

get_processed_entity_map(entity_map)[source]

Processes the entity map into a format suitable for indexing and similarity searching

Parameters:entity_map (Dict[str, Union[str, List]]) -- Entity map if passed in directly instead of loading from a file path
Returns:
A processed entity map better suited for indexing and
querying
Return type:processed_entity_map (Dict)
load_deprecated()[source]

A method to handle the deprecated way of using the .load() method in entity resolvers. This ensures backwards compatibility when loading models that were built using an older version of Mindmeld i.e a version <=4.4.0. Since no hash pickle file is dumped in the older version of MindMeld, using the latest .load() method throws a FileNotFoundError.

resolver_configurations