mindmeld.active_learning.heuristics module

This module contains query selection heuristics for the Active Learning Pipeline.

class mindmeld.active_learning.heuristics.DisagreementSampling[source]

Bases: abc.ABC

static rank_2d(confidences_2d: List[List[float]]) → List[int][source]

Need confidences_2d from more than one model (confidences_3d) to run DisagreementSampling.

Parameters:confidences_2d (List[List[float]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_3d(confidences_3d: List[List[List[float]]]) → List[int][source]

Finds the most frequent class label for a given element across all models. Calculates the agreement per element (% of models who voted the most frequent class). Ranks elements by highest to lowest disagreement.

Parameters:confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
class mindmeld.active_learning.heuristics.EnsembleSampling[source]

Bases: abc.ABC

static get_heuristics_2d() → tuple[source]
static get_heuristics_3d() → tuple[source]
static rank_2d(confidences_2d: List[List[float]]) → List[int][source]

Combine ranks from all heuristics that can support ranking given 2d confidence input.

Parameters:confidences_2d (List[List[float]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_3d(confidences_3d: List[List[List[float]]]) → List[int][source]

Combine ranks from all heuristics that can support ranking given 3d confidence input.

Parameters:confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
class mindmeld.active_learning.heuristics.EntropySampling[source]

Bases: abc.ABC

static rank_2d(confidences_2d: List[List[float]]) → List[int][source]

Calculates the entropy score of the confidences per element. Elements are ranked from highest to lowest entropy.

Parameters:confidences_2d (List[List[float]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_3d(confidences_3d: List[List[List[float]]]) → List[int][source]

Calculates the entropy score of the confidences per element. Elements are ranked from highest to lowest entropy. This is done for each confidence_2d in a confidence_3d. The rankings are added to generate a final ranking.

Parameters:confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_entities(entity_confidences: List[List[List[float]]]) → List[int][source]

Calculates the entropy score of the entity confidences per element. Elements are ranked from highest to lowest entropy. :returns: Token Entropy: Average of per token entropies across a query; or

Total Token Entropy: Sum of token entropies across a query.
Return type:Ranked lists based on either
class mindmeld.active_learning.heuristics.Heuristic[source]

Bases: abc.ABC

Heuristic base class used as Active Learning query selection strategies.

static ordered_indices_list_to_final_rank(ordered_sample_indices_list: List[List[int]])[source]

Converts multiple lists of ordered indices to a final rank. :param ordered_sample_indices_list: Multiple lists of ordered sample indices. :type ordered_sample_indices_list: List[List[int]]

Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_2d(confidences_2d: List[List[float]]) → List[int][source]

Ranking method for 2d confidence arrays. :param confidences_2d: Confidence probabilities per element. :type confidences_2d: List[List[float]]

Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_3d(confidences_3d: List[List[List[float]]]) → List[int][source]

Ranking method for 3d confidence arrays. :param confidences_3d: Confidence probabilities per element. :type confidences_3d: List[List[List[float]]]

Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
class mindmeld.active_learning.heuristics.HeuristicsFactory[source]

Bases: object

Heuristics Factory Class

static get_heuristic(heuristic) → mindmeld.active_learning.heuristics.Heuristic[source]

A static method to get a Heuristic class.

Parameters:heuristic (str) -- Name of the desired Heuristic class
Returns:Heuristic Class
Return type:(Heuristic)
class mindmeld.active_learning.heuristics.KLDivergenceSampling[source]

Bases: abc.ABC

static get_divergences_per_element_no_segments(confidences_3d: List[List[List[float]]]) → List[List[float]][source]
Parameters:confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
Returns:Divergences per model for each element.
Return type:divergences (List[List[float]])
static get_divergences_per_element_with_segments(confidences_3d: List[List[List[float]]], confidence_segments: Dict) → List[List[float]][source]

Calculate divergences by segments defined in confidence segments where p_d is the probabilities within class X and q_d is the mean probability distribution for class X. Divergence(p_d, q_d) is calculated for each element in all classes.

Parameters:
  • confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
  • confidence_segments (Dict[(str, Tuple(int,int))]) -- A dictionary mapping segments to run KL Divergence.
Returns:

Divergences per model for each element.

Return type:

divergences (List[List[float]])

static get_domain(confidence_segments: Dict, row: List[List[float]]) → str[source]

Get the domain for a given probability row, inferred based on the non-zero values. :param confidence_segments: A mapping between domains (str) to the

corresponding indices in the probability vector. Used for intent-level KLD.
Parameters:row (List[List[float]]) -- A single row representing a queries probability distrubition.
Returns:The domain that the given row belongs to.
Return type:domain (str)
Raises:AssertionError -- If a row does not have an associated domain.
static rank_2d(confidences_2d: List[List[float]]) → List[int][source]

Need confidences_2d from more than one model (confidences_3d) to run KLDivergenceSampling.

Parameters:confidences_2d (List[List[float]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_3d(confidences_3d: List[List[List[float]]], confidence_segments: Dict = None) → List[int][source]

Calculates the KL Divergence between the average confidence distribution across all models for a given class and the confidence distribution for a given element in said class. Elements are ranked from highest to lowest divergence.

Parameters:
  • confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
  • confidence_segments (Dict[(str, Tuple(int,int))]) -- A dictionary mapping segments to run KL Divergence.
Returns:

Indices corresponding to elements ranked by the heuristic.

Return type:

ranked_indices (List[int])

class mindmeld.active_learning.heuristics.LeastConfidenceSampling[source]

Bases: abc.ABC

static rank_2d(confidences_2d: List[List[float]]) → List[int][source]

First calculates the highest (max) confidences per element and then returns the elements from lowest confidence to highest confidence.

Parameters:confidences_2d (List[List[float]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_3d(confidences_3d: List[List[List[float]]]) → List[int][source]

First calculates the highest (max) confidences per element and then returns the elements with the lowest max confidence. This is done for each confidence_2d in a confidence_3d. The rankings are added to generate a final ranking.

Parameters:confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_entities(entity_confidences: List[List[List[float]]]) → List[int][source]
class mindmeld.active_learning.heuristics.MarginSampling[source]

Bases: abc.ABC

static beam_search_decoder(predictions, top_k=3)[source]
static rank_2d(confidences_2d: List[List[float]]) → List[int][source]

Calculates the "margin" or difference between the highest and second highest confidence score per element. Elements are ranked from lowest to highest margin.

Parameters:confidences_2d (List[List[float]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_3d(confidences_3d: List[List[List[float]]]) → List[int][source]

Calculates the "margin" or difference between the highest and second highest confidence score per element. Elements are ranked from lowest to highest margin. This is done for each confidence_2d in a confidence_3d. The rankings are added to generate a final ranking.

Parameters:confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_entities(entity_confidences: List[List[List[float]]]) → List[int][source]

Queries are ranked on the basis of Margin Sampling for tag sequences. This approach uses beam search to obtain the top 2 queries/sequences in terms of the query confidences for entities. The margin is calculated between these top two sequences. (For more information about this method: https://dl.acm.org/doi/pdf/10.5555/1613715.1613855)

class mindmeld.active_learning.heuristics.RandomSampling[source]

Bases: abc.ABC

static random_rank(num_elements: int) → List[int][source]

Randomly shuffles indices. :param num_elements: Number of elements to randomly sample. :type num_elements: int

Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_2d(confidences_2d: List[List[float]]) → List[int][source]

Randomly shuffles indices. :param confidences_2d: Confidence probabilities per element. :type confidences_2d: List[List[float]]

Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
static rank_3d(confidences_3d: List[List[List[float]]]) → List[int][source]

Randomly shuffles indices. :param confidences_3d: Confidence probabilities per element. :type confidences_3d: List[List[List[float]]]

Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])
mindmeld.active_learning.heuristics.stratified_random_sample(labels: List) → List[int][source]

Reorders indices in evenly repeating pattern for as long as possible and then shuffles and appends the remaining labels. The first part of this list will maintain a uniform distrubition across labels, however, since the labels may not be perfectly balanced the remaining portion will have a similar distribution as the original data.

|-------- Evenly Repeating --------||--- Shuffled Remaining ----|

For Example: ["R","B","C","R","B","C","R","B","C","B","R","R","B","B","B","R"]

Parameters:labels (List[str or int]) -- A list of labels. (Eg: labels = ["R", "B", "B", "C"])
Returns:Indices corresponding to elements ranked by the heuristic.
Return type:ranked_indices (List[int])