mindmeld.gazetteer module¶
-
class
mindmeld.gazetteer.Gazetteer(name, text_preparation_pipeline, exclude_ngrams=False)[source]¶ Bases:
objectThis class holds the following fields, which are extracted and exported to file.
-
entity_count¶ int -- Total entities in the file
-
pop_dict¶ dict -- A dictionary containing the entity name as a key and the popularity score as the value. If there are more than one entity with the same name, the popularity is the maximum value across all duplicate entities.
-
index¶ dict -- A dictionary containing the inverted index, which maps terms and n-grams to the set of documents which contain them
-
entities¶ list -- A list of all entities
-
sys_types¶ set -- The set of nested numeric types for this entity
-
dump(gaz_path)[source]¶ Persists the gazetteer to disk.
Parameters: gaz_path (str) -- The location on disk where the gazetteer should be stored
-
from_dict(serialized_gaz)[source]¶ De-serializes gaz object from a dictionary using deep copy ops
Parameters: serialized_gaz (dict) -- The serialized gaz object
-
load(gaz_path)[source]¶ Loads the gazetteer from disk
Parameters: gaz_path (str) -- The location on disk where the gazetteer is stored
-
update_with_entity_data_file(filename, popularity_cutoff, normalizer)[source]¶ Updates this gazetteer with data from an entity data file.
Parameters:
-
update_with_entity_map(mapping, normalizer, update_if_missing_canonical=True)[source]¶ Update gazetteer with a list of normalized key,value pairs from the input mapping list
Parameters: - mapping (list) -- A list of dicts containing canonnical names and whitelists of a particular entity
- normalizer (func) -- A QueryFactory normalization function that is used to normalize the input mapping data before they are added to the gazetteer.
-
-
class
mindmeld.gazetteer.NestedGazetteer(start_token_index, end_token_index_plus_one, gaz_name, token_ngram, raw_ngram)[source]¶ Bases:
objectThis class represents a gazetteer entry corresponding to a Query object
-
end_token_index_plus_one¶
-
gaz_name¶
-
raw_ngram¶
-
start_token_index¶
-
token_ngram¶
-