mindmeld.models.embedder_models module

This module contains the embedder model class.

class mindmeld.models.embedder_models.BertEmbedder(app_path=None, cache_path=None, pretrained_name_or_abspath=None, **kwargs)[source]

Bases: mindmeld.models.embedder_models.Embedder

Encoder class for bert models based on https://github.com/UKPLab/sentence-transformers

encode(phrases)[source]

Encodes input text(s) into embeddings, one vector for each phrase

Parameters:phrases (str, list[str]) -- textual inputs that are to be encoded using sentence transformers' model
Returns:
By default, a numpy array is returned.
If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned.
Return type:(Union[List[Tensor], ndarray, Tensor])
load()[source]

Loads the embedder model

Returns:The model object.
CACHE_MODELS = {}
model_id

Returns a unique hash representation of the embedder model based on its name and configs

class mindmeld.models.embedder_models.Embedder(app_path=None, cache_path=None, **kwargs)[source]

Bases: abc.ABC

Base class for embedder model

class EmbeddingsCache(cache_path=None)[source]

Bases: object

clear(cache_path=None)[source]

Deletes the cache file.

dump(cache_path=None)[source]

Dumps the cache to disk.

get(text, default=None)[source]
load(cache_path=None)[source]

Loads the cache file.

reset()[source]
add_to_cache(mean_or_max_pooled_whitelist_embs)[source]

Method to add custom embeddings to cache without triggering .encode(). Example, one can manually add some max-pooled or mean-pooled embeddings to cache. This method is created to entertain storing superficial text-encoding pairs (superficial because the encodings are not the encodings of the text itself but a combination of encodings of some list of texts from the same embedder model). For example, to add superficial entity embeddings as average of whitelist embeddings in Entity Resolution.

Parameters:mean_or_max_pooled_whitelist_embs (dict) -- texts and their corresponding superficial embeddings as a 1D numpy array, having same length as emb_dim of the embedder
clear_cache(cache_path=None)[source]
dump(cache_path=None)[source]
dump_cache(cache_path=None)[source]
encode(text_list)[source]
Parameters:text_list (list) -- A list of text strings for which to generate the embeddings.
Returns:A list of numpy arrays of the embeddings.
Return type:(list)
find_similarity(src_texts: List[str], tgt_texts: List[str] = None, top_n: int = 20, scores_normalizer: str = None, similarity_function: Callable[[List[Any], List[Any]], numpy.ndarray] = None, _return_as_dict=False, _no_sort=False)[source]

Computes the cosine similarity

Parameters:
  • src_texts (Union[str, list]) -- string or list of strings to obtain matching scores for.
  • tgt_texts (list, optional) -- list of strings that will be matched to. if None, existing cache is used as target strings
  • top_n (int, optional) -- maximum number of results to populate. if None, equals length of tgt_texts
  • scores_normalizer (str, optional) -- normalizer type to normalize scores. Allowed values are: "min_max_scaler", "standard_scaler"
  • similarity_function (function, optional) -- if None, defaults to pytorch_cos_sim. If specified, must take two numpy-array/pytorch-tensor arguments for similarity computation with an optional argument to return results as numpy or tensor
  • _return_as_dict (bool, optional) -- if the results should be returned as a dictionary of target_text name as keys and scores as corresponding values
  • _no_sort (bool, optional) -- If True, results are returned without sorting. This is helpful at times when you wish to do additional wrapper operations on top of raw results and would like to save computational time without sorting.
Returns:

if _return_as_dict, returns a dictionary of tgt_texts and

their scores, else a list of tuple each consisting of a src_text paired with its similarity scores with all tgt_texts as a np array (sorted list in descending order)

Return type:

Union[dict, list[tuple]]

get_encodings(text_list, add_to_cache=True) → List[Any][source]

Fetches the encoded values from the cache, or generates them and adds to cache unless add_to_cache is set to False. This method is wrapped around .encode() by maintaining an embedding cache.

Parameters:
  • text_list (list) -- A list of text strings for which to get the embeddings.
  • add_to_cache (bool) -- If True, adds the encodings to self.cache and returns embeddings
Returns:

A list of numpy arrays with the embeddings.

Return type:

(list)

static get_hashid(**kwargs)[source]
load(**kwargs)[source]

Loads the embedder model

Returns:The model object.
load_cache(cache_path=None)[source]
static pytorch_cos_sim(src_vecs, tgt_vecs, return_tensor=False)[source]

Computes the cosine similarity for 2d matrices

Parameters:
  • src_vecs -- a 2d numpy array or pytorch tensor
  • tgt_vecs -- a 2d numpy array or pytorch tensor
  • return_tensor -- If False, this method returns the cosine similarity as a numpy 2d array instead of tensor, else returns 2d tensor output
model_id

Returns a unique hash representation of the embedder model based on its name and configs

class mindmeld.models.embedder_models.GloveEmbedder(app_path=None, cache_path=None, **kwargs)[source]

Bases: mindmeld.models.embedder_models.Embedder

Encoder class for GloVe embeddings as described here: https://nlp.stanford.edu/projects/glove/

dump(cache_path=None)[source]

Dumps the cache to disk.

encode(text_list)[source]
Parameters:text_list (list) -- A list of text strings for which to generate the embeddings.
Returns:A list of numpy arrays of the embeddings.
Return type:(list)
load()[source]

Loads the embedder model

Returns:The model object.
DEFAULT_EMBEDDING_DIM = 300
model_id

Returns a unique hash representation of the embedder model based on its name and configs