mindmeld.models package

class mindmeld.models.ModelFactory[source]

Bases: object

Auto class that identifies appropriate text/tagger model from text_models.py/tagger_models.py to load one based on the inputted configs or from the loaded configs file.

The .create_model_from_config() methods allows to load the appropriate model when a ModelConfig is passed. The .create_model_from_path() method uses AbstractModel's load method to load a dumped config, which is then used to load appropriate model and return it through a metadata dictionary object.

classmethod create_model_from_config(model_config: Union[dict, mindmeld.models.model.ModelConfig]) → Type[mindmeld.models.model.AbstractModel][source]

Instantiates and returns a valid model from the specified model configs

Parameters:model_config (Union[dict, ModelConfig]) -- Model configs inputted either as dict or an instance of ModelConfig
Returns:A text/tagger model instance
Return type:model (Type[AbstractModel])
Raises:ValueError -- When the configs are invalid
classmethod create_model_from_path(path: str) → Union[None, Type[mindmeld.models.model.AbstractModel]][source]

Loads and returns a model from the specified path

Parameters:path (str) -- A pickle file path from where a model can be loaded
Returns:
Returns None when the specified path is not
found or if the model loaded from the specified path is a NoneType. If found a valid config and a valid model, the model is load by calling .load() method and returned
Return type:model (Union[None, Type[AbstractModel]])
Raises:ValueError -- When the path is invalid
static register_models() → None[source]
class mindmeld.models.Embedder(app_path=None, cache_path=None, **kwargs)[source]

Bases: abc.ABC

Base class for embedder model

class EmbeddingsCache(cache_path=None)[source]

Bases: object

clear(cache_path=None)[source]

Deletes the cache file.

dump(cache_path=None)[source]

Dumps the cache to disk.

get(text, default=None)[source]
load(cache_path=None)[source]

Loads the cache file.

reset()[source]
add_to_cache(mean_or_max_pooled_whitelist_embs)[source]

Method to add custom embeddings to cache without triggering .encode(). Example, one can manually add some max-pooled or mean-pooled embeddings to cache. This method is created to entertain storing superficial text-encoding pairs (superficial because the encodings are not the encodings of the text itself but a combination of encodings of some list of texts from the same embedder model). For example, to add superficial entity embeddings as average of whitelist embeddings in Entity Resolution.

Parameters:mean_or_max_pooled_whitelist_embs (dict) -- texts and their corresponding superficial embeddings as a 1D numpy array, having same length as emb_dim of the embedder
clear_cache(cache_path=None)[source]
dump(cache_path=None)[source]
dump_cache(cache_path=None)[source]
encode(text_list)[source]
Parameters:text_list (list) -- A list of text strings for which to generate the embeddings.
Returns:A list of numpy arrays of the embeddings.
Return type:(list)
find_similarity(src_texts: List[str], tgt_texts: List[str] = None, top_n: int = 20, scores_normalizer: str = None, similarity_function: Callable[[List[Any], List[Any]], numpy.ndarray] = None, _return_as_dict=False, _no_sort=False)[source]

Computes the cosine similarity

Parameters:
  • src_texts (Union[str, list]) -- string or list of strings to obtain matching scores for.
  • tgt_texts (list, optional) -- list of strings that will be matched to. if None, existing cache is used as target strings
  • top_n (int, optional) -- maximum number of results to populate. if None, equals length of tgt_texts
  • scores_normalizer (str, optional) -- normalizer type to normalize scores. Allowed values are: "min_max_scaler", "standard_scaler"
  • similarity_function (function, optional) -- if None, defaults to pytorch_cos_sim. If specified, must take two numpy-array/pytorch-tensor arguments for similarity computation with an optional argument to return results as numpy or tensor
  • _return_as_dict (bool, optional) -- if the results should be returned as a dictionary of target_text name as keys and scores as corresponding values
  • _no_sort (bool, optional) -- If True, results are returned without sorting. This is helpful at times when you wish to do additional wrapper operations on top of raw results and would like to save computational time without sorting.
Returns:

if _return_as_dict, returns a dictionary of tgt_texts and

their scores, else a list of tuple each consisting of a src_text paired with its similarity scores with all tgt_texts as a np array (sorted list in descending order)

Return type:

Union[dict, list[tuple]]

get_encodings(text_list, add_to_cache=True) → List[Any][source]

Fetches the encoded values from the cache, or generates them and adds to cache unless add_to_cache is set to False. This method is wrapped around .encode() by maintaining an embedding cache.

Parameters:
  • text_list (list) -- A list of text strings for which to get the embeddings.
  • add_to_cache (bool) -- If True, adds the encodings to self.cache and returns embeddings
Returns:

A list of numpy arrays with the embeddings.

Return type:

(list)

static get_hashid(**kwargs)[source]
load(**kwargs)[source]

Loads the embedder model

Returns:The model object.
load_cache(cache_path=None)[source]
static pytorch_cos_sim(src_vecs, tgt_vecs, return_tensor=False)[source]

Computes the cosine similarity for 2d matrices

Parameters:
  • src_vecs -- a 2d numpy array or pytorch tensor
  • tgt_vecs -- a 2d numpy array or pytorch tensor
  • return_tensor -- If False, this method returns the cosine similarity as a numpy 2d array instead of tensor, else returns 2d tensor output
model_id

Returns a unique hash representation of the embedder model based on its name and configs

class mindmeld.models.ModelConfig(model_type: str = None, example_type: str = None, label_type: str = None, features: Dict = None, model_settings: Dict = None, params: Dict = None, param_selection: Dict = None, train_label_set: Pattern[str] = None, test_label_set: Pattern[str] = None)[source]

Bases: object

A value object representing a model configuration.

model_type

str -- The name of the model type. Will be used to find the model class to instantiate

example_type

str -- The type of the examples which will be passed into fit() and predict(). Used to select feature extractors

label_type

str -- The type of the labels which will be passed into fit() and returned by predict(). Used to select the label encoder

model_settings

dict -- Settings specific to the model type specified

params

dict -- Params to pass to the underlying classifier

param_selection

dict -- Configuration for param selection (using cross validation) {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': {} }

features

dict -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features

train_label_set

regex pattern -- The regex pattern for finding training file names.

test_label_set

regex pattern -- The regex pattern for finding testing file names.

get_ngram_lengths_and_thresholds(rname: str) → Tuple[source]

Returns the n-gram lengths and thresholds to extract to optimize resource collection

Parameters:rname (string) -- Name of the resource
Returns:tuple containing:
  • lengths (list of int): list of n-gram lengths to be extracted
  • thresholds (list of int): thresholds to be applied to corresponding n-gram lengths
Return type:(tuple)
required_resources() → Set[source]

Returns the resources this model requires

Returns:set of required resources for this model
Return type:set
resolve_config(new_config: mindmeld.models.model.ModelConfig)[source]

This method resolves any config incompatibility issues by loading the latest settings from the app config to the current config

Parameters:new_config (ModelConfig) -- The ModelConfig representing the app's latest config
to_dict() → Dict[source]

Converts the model config object into a dict

Returns:A dict version of the config
Return type:dict
to_json() → str[source]

Converts the model config object to JSON

Returns:JSON representation of the classifier
Return type:str
example_type
features
label_type
model_settings
model_type
param_selection
params
test_label_set
train_label_set
mindmeld.models.create_model(config)[source]

Creates a model instance using the provided configuration

Parameters:config (ModelConfig) -- A model configuration
Returns:a configured model
Return type:Model
Raises:ValueError -- When model configuration is invalid
mindmeld.models.load_model(path)[source]

Loads a model from a specified path

Parameters:path (str) -- A path where the model configuration is pickled along with other metadata
Returns:
metadata loaded from the path, which contains the configured model in 'model' key
and the model configs in 'model_config' key along with other keys
Return type:dict
Raises:ValueError -- When model configuration is invalid
mindmeld.models.create_embedder_model(app_path, config)[source]

Creates and loads an embedder model

Parameters:config (dict) -- Model settings passed in as a dictionary with 'embedder_type' being a required key
Returns:An instance of appropriate embedder class
Return type:Embedder
Raises:ValueError -- When model configuration is invalid or required key is missing
mindmeld.models.register_embedder(embedder_type, embedder)[source]