Working with the Auto Annotator

The Auto Annotator

  • is a tool to automatically annotate or unannotate select entities across all labelled data in an application.
  • supports the development of custom Annotators.

Note

The examples in this section require the HR Assistant blueprint application. To get the app, open a terminal and run mindmeld blueprint hr_assistant. Examples related to the MultiLingualAnnotator requires the Health Screening blueprint application. To get the app, open a terminal and run mindmeld blueprint hr_assistant

Warning

Changes by an Auto Annotator cannot be undone and MindMeld does not backup query data. We recommend using version control software such as Github.

Quick Start

This section briefly explains the use of the annotate and unannotate commands. For more details, please read the next section.

Before annotating the data, we will first remove all existing annotations using unannotate. Be sure to include the --unannotate_all flag when running the following command in the command-line.

Command-line:

mindmeld unannotate --app-path "hr_assistant" --unannotate_all

We can now proceed to annotate our data using the command below.

Command-line:

mindmeld annotate --app-path "hr_assistant"

The following section explains this same process in more detail.

Using the Auto Annotator

The Auto Annotator can be used by importing a class that implements the Annotator abstract class in the auto_annotator module or through the command-line. We will demonstrate both approaches for annotation and unannotation using the MultiLingualAnnotator class.

Annotate

By default, all entity types supported by an Annotator will by annotated if they do not overlap with existing annotations.

You can annotate using the command-line. To overwrite existing annotations that overlap with new annotations, pass in the optional param --overwrite.

mindmeld annotate --app-path "hr_assistant" --overwrite

Alternatively, you can annotate by creating an instance of the Annotator class and running the Python code below. An optional param overwrite can be passed in here as well.

from mindmeld.auto_annotator import MultiLingualAnnotator
annotation_rules = [
        {
                "domains": ".*",
                "intents": ".*",
                "files": ".*",
                "entities": ".*",
        }
]
mla = MultiLingualAnnotator(
        app_path="hr_assistant",
        annotation_rules=annotation_rules,
        overwrite=True
)
mla.annotate()

If you do not want to annotate all supported entities, you can specify annotation rules instead.

For example, let's annotate sys_person entities from the get_hierarchy_up intent in the hierarchy domain. To do this, we can add the following AUTO_ANNOTATOR_CONFIG dictionary to config.py. Notice that we are setting overwrite to True since we want to replace the existing custom entity label, name.

AUTO_ANNOTATOR_CONFIG = {

        "annotator_class": "MultiLingualAnnotator",
        "overwrite": True,
        "annotation_rules": [
                {
                        "domains": "hierarchy",
                        "intents": "get_hierarchy_up",
                        "files": "train.txt",
                        "entities": "sys_person",
                }
        ],
        "unannotate_supported_entities_only": True,
        "unannotation_rules": None
}

Before running the annotation, let's take a look at the first four queries in the train.txt file for the get_hierarchy_up intent:

I wanna get a list of all of the employees that are currently {manage|manager} {caroline|name}
I wanna know {Tayana Jeannite|name}'s person in {leadership|manager} of her?
is it correct to say that {Angela|name} is a {boss|manager}?
who all is {management|manager} of {tayana|name}

After running annotate we find that instances of sys_person have been labelled and have overwritten previous instances of the custom entity, name.

I wanna get a list of all of the employees that are currently {manage|manager} {caroline|sys_person}
I wanna know {Tayana Jeannite|sys_person}'s person in {leadership|manager} of her?
is it correct to say that {Angela|sys_person} is a {boss|manager}?
who all is {management|manager} of {tayana|sys_person}

You can annotate with multiple annotation rules. For more details on annotation rules please read the "Auto Annotator Configuration" section below.

Unannotate

By default, only the entities that are supported by an Annotator will be unannotated.

You can unannotate using the command-line. To unannotate all entities, pass in the optional param --unannotate_all.

mindmeld unannotate --app-path "hr_assistant" --unannotate_all

To unannotate by creating an instance of the Annotator class, run the Python code below. To unannotate all annotations, use the the unannotation_rules shown below and set unannotate_supported_entities_only to False.

from mindmeld.auto_annotator import MultiLingualAnnotator
unannotation_rules = [
        {
                "domains": ".*",
                "intents": ".*",
                "files": ".*",
                "entities": ".*",
        }
]
mla = MultiLingualAnnotator(
        app_path="hr_assistant",
        unannotation_rules=unannotation_rules,
        unannotate_supported_entities_only=False
)
mla.unannotate()

If you see the following message, you need to update the unannotate parameter in your custom AUTO_ANNOTATOR_CONFIG dictionary in config.py. You can refer to the config specifications in the "Auto Annotator Configuration" section below.

'unannotate' field is not configured or misconfigured in the `config.py`. We can't find any file to unannotate.

If you do not want to unannotate all entities, you can can specify annotation rules to be used for unannotation in the unannotate param of your config. For example, let's unannotate sys_time entities from the get_date_range_aggregate intent in the date domain. To do this, we can add the following AUTO_ANNOTATOR_CONFIG dictionary to config.py.

AUTO_ANNOTATOR_CONFIG = {

        "annotator_class": "MultiLingualAnnotator",
        "overwrite": False,
        "annotate": [{"domains": ".*", "intents": ".*", "files": ".*", "entities": ".*"}],
        "unannotate_supported_entities_only": True,
        "unannotate": [
                {
                        "domains": "date",
                        "intents": "get_date_range_aggregate",
                        "files": "train.txt",
                        "entities": "sys_time",
                }
        ],
}

Note

The content of annotate in the config has no effect on unannotation. Similarly, unannotate in the config has no affect on annotation. These processes are independent and are only affected by the corresponding parameter in the config.

Before running the unannotation, let's take a look at the first four queries in the train.txt file for the get_date_range_aggregate intent:

{sum|function} of {non-citizen|citizendesc} people {hired|employment_action} {after|date_compare} {2005|sys_time}
What {percentage|function} of employees were {born|dob} {before|date_compare} {1992|sys_time}?
{us citizen|citizendesc} people with {birthday|dob} {before|date_compare} {1996|sys_time} {count|function}
{count|function} of {eligible non citizen|citizendesc} workers {born|dob} {before|date_compare} {1994|sys_time}

After running unannotate we find that instances of sys_time have been unannotated as expected.

{sum|function} of {non-citizen|citizendesc} people {hired|employment_action} {after|date_compare} 2005
What {percentage|function} of employees were {born|dob} {before|date_compare} 1992?
{us citizen|citizendesc} people with {birthday|dob} {before|date_compare} 1996 {count|function}
{count|function} of {eligible non citizen|citizendesc} workers {born|dob} {before|date_compare} 1994

Default Auto Annotator: MultiLingual Annotator

The mindmeld.auto_annotator module contains an abstract Annotator class. This class serves as a base class for any MindMeld Annotator including the MultiLingualAnnotator class. The MultiLingualAnnotator leverages Spacy's Named Entity Recognition system and duckling to detect entities.

Supported Entities and Languages

Up to 21 entities are supported across 15 languages. The table below defines these entities and whether they are resolvable by duckling.

Supported Entities Resolvable by Duckling Examples or Definition
"sys_time" Yes "today", "Tuesday, Feb 18" , "last week"
"sys_interval" Yes "from 9:30 to 11:00am", "Monday to Friday", "Tuesday 3pm to Wednesday 7pm"
"sys_duration" Yes "2 hours", "15 minutes", "3 days"
"sys_number" Yes "58", "two hundred", "1,394,345.45"
"sys_amount-of-money" Yes "ten dollars", "seventy-eight euros", "$58.67"
"sys_distance" Yes "500 meters", "498 miles", "47.5 inches"
"sys_weight" Yes "400 pound", "3 grams", "47.5 mg"
"sys_ordinal" Yes "3rd place" ("3rd"), "fourth street" ("fourth"), "5th"
"sys_percent" Yes "four percent", "12%", "5 percent"
"sys_org" No "Cisco", "IBM", "Google"
"sys_loc" No "Europe", "Asia", "the Alps", "Pacific ocean"
"sys_person" No "Blake Smith", "Julia", "Andy Neff"
"sys_gpe" No "California", "FL", "New York City", "USA"
"sys_norp" No Nationalities or religious or political groups.
"sys_fac" No Buildings, airports, highways, bridges, etc.
"sys_product" No Objects, vehicles, foods, etc. (Not services.)
"sys_event" No Named hurricanes, battles, wars, sports events, etc.
"sys_law" No Named documents made into laws.
"sys_language" No Any named language.
"sys_work-of-art" No Titles of books, songs, etc.
"sys_other-quantity" No "10 joules", "30 liters", "15 tons"

Supported languages include English (en), Spanish (es), French (fr), German (de), Danish (da), Greek (el), Portuguese (pt), Lithuanian (lt), Norwegian Bokmal (nb), Romanian (ro), Polish (pl), Italian (it), Japanese (ja), Chinese (zh), Dutch (nl). The table below identifies the supported entities for each language.

  EN ES FR DE DA EL PT LT NB RO PL IT JA ZH NL
sys_amount-of-money y y y n n n y n y y n n y y y
sys_distance y y y y n n y n n y n y n y y
sys_duration y y y y y y y y y y y y y y y
sys_event y n n n n y n n n y n n y y y
sys_fac y n n n n n n n n y n n y y y
sys_gpe y n n n n y n y n y y n y y y
sys_interval y y y y y y y n y y y y n y y
sys_language y n n n n n n n n y n n y y y
sys_law y n n n n n n n n n n n y y y
sys_loc y y y y y y y y y y n y y y y
sys_norp y n n n n n n n n y n n y y y
sys_number y y y y y y y n y y y y y y y
sys_ordinal y y y y y y y n y y y y y y y
sys_org y y y y y y y y y y y y y y y
sys_other-quantity y n n n n n n n n y n n y y y
sys_percent y n n n n n n n n n n n y y y
sys_person y y y y y y y y y y y y y y y
sys_product y n n n n y n n n y n n y y y
sys_time y y y y y y y y y y y y n y y
sys_weight y n n n n n n n n y n n y y y
sys_work_of_art y n n n n n n n n y n n y y y

Working with English Sentences

To detect entities in a single sentence first create an instance of the MultiLingualAnnotator class. If a language is not specified in LANGUAGE_CONFIG (config.py) then by default English will be used.

from mindmeld.auto_annotator import MultiLingualAnnotator
mla = MultiLingualAnnotator(app_path="hr_assistant")

Then use the parse() function.

mla.parse("Apple stock went up $10 last monday.")

Three entities are automatically recognized and a list of QueryEntity objects is returned. Each QueryEntity represents a detected entity.:

[
        <QueryEntity 'Apple' ('sys_org') char: [0-4], tok: [0-0]>,
        <QueryEntity '$10' ('sys_amount-of-money') char: [20-22], tok: [4-4]>,
        <QueryEntity 'last monday' ('sys_time') char: [24-34], tok: [5-6]>
]

The Auto Annotator detected "Apple" as sys_org. Moreover, it recognized "$10" as sys_amount-of-money and resolved its value as 10 and unit as "$". Lastly, it recognized "last monday" as sys_time and resolved its value to be a timestamp representing the last monday from the current date.

To restrict the types of entities returned from the parse() method use the entity_types parameter and pass in a list of entities to restrict parsing to. By default, all entities are allowed. For example, we can restrict the output of the previous example by doing the following:

allowed_entites = ["sys_org", "sys_amount-of-money", "sys_time"]
sentence = "Apple stock went up $10 last monday."
mla.parse(sentence=sentence, entity_types=allowed_entities)

Working with Non-English Sentences

The MultiLingualAnnotator will use the language and locale specified in the LANGUAGE_CONFIG (config.py) if it used through the command-line.

LANGUAGE_CONFIG = {'language': 'es'}

Many Spacy non-English NER models have limited entity support. To overcome this, in addition to the entities detected by non-English NER models, the MultiLingualAnnotator translates the sentence to English and detects entities using the English NER model. The English detected entities are compared against duckling candidates for the non-English sentence. Duckling candidates with a match between the type and value of the entity or the translated body text are selected. If a translation service is not available, the MultiLingualAnnotator selects the duckling candidates with the largest non-overlapping spans. The sections below describe the steps to setup the annotator depending on whether a translation service is being used.

Annotating with a Translation Service (Google)

The MultiLingualAnnotator can leverage the Google Translation API to better detect entities in non-English sentences. To use this feature, export your Google application credentials.

export GOOGLE_APPLICATION_CREDENTIALS="/<YOUR_PATH>/google_application_credentials.json"

Install the extras requirements for annotators.

pip install mindmeld[language_annotator]

Finally, specify the translator in AUTO_ANNOTATOR_CONFIG. Set translator to GoogleTranslator.

Annotating without a Translation Service

We can still use the MultiLingualAnnotator without a translation service. To do so, set translator to NoOpTranslator in AUTO_ANNOTATOR_CONFIG.

Spanish Sentence Example

Let's take a look at an example of the MultiLingualAnnotator detecting entities in Spanish sentences. To use a Spanish MindMeld application we can download the Screening App blueprint with the following command:

mindmeld blueprint screening_app

We can now create our MultiLingualAnnotator object and pass in the app_path. If a spanish Spacy model is not found in the environment, it will automatically be downloaded.

from mindmeld.auto_annotator import MultiLingualAnnotator
mla = MultiLingualAnnotator(
        app_path="screening_app",
        language="es",
        locale=None,
)

Then use the parse() function.

mla.parse("Las acciones de Apple subieron $10 el lunes pasado.")

Three entities are automatically recognized.

[
        <QueryEntity 'Apple' ('sys_org') char: [16-20], tok: [3-3]>,
        <QueryEntity 'el lunes pasado' ('sys_time') char: [35-49], tok: [6-8]>,
        <QueryEntity '$10' ('sys_amount-of-money') char: [31-33], tok: [5-5]>
]

Auto Annotator Configuration

The DEFAULT_AUTO_ANNOTATOR_CONFIG shown below is the default config for an Annotator. A custom config can be included in config.py by duplicating the default config and renaming it to AUTO_ANNOTATOR_CONFIG. Alternatively, a custom config dictionary can be passed in directly to MultiLingualAnnotator or any Annotator class upon instantiation.

DEFAULT_AUTO_ANNOTATOR_CONFIG = {

        "annotator_class": "MultiLingualAnnotator",
        "overwrite": False,
        "annotation_rules": [
                {
                        "domains": ".*",
                        "intents": ".*",
                        "files": ".*",
                        "entities": ".*",
                }
        ],
        "unannotate_supported_entities_only": True,
        "unannotation_rules": None,
}

Let's take a look at the allowed values for each setting in an Auto Annotator configuration.

'annotator_class' (str): The class in auto_annotator.py to use for annotation when invoked from the command line. By default, MultiLingualAnnotator is used.

'overwrite' (bool): Whether new annotations should overwrite existing annotations in the case of a span conflict. False by default.

'annotation_rules' (list): A list of annotation rules where each rule is represented as a dictionary. Each rule must have four keys: domains, intents, files, and entities. Annotation rules are combined internally to create Regex patterns to match selected files. The character '.*' can be used if all possibilities in a section are to be selected, while possibilities within a section are expressed with the usual Regex special characters, such as '.' for any single character and '|' to represent "or".

{
        "domains": "(faq|salary)",
        "intents": ".*",
        "files": "(train.txt|test.txt)",
        "entities": "(sys_amount-of-money|sys_time)",
}

The rule above would annotate all text files named "train" or "test" in the "faq" and "salary" domains. Only sys_amount-of-money and sys_time entities would be annotated. Internally, the above rule is combined to a single pattern: "(faq|salary)/.*/(train.txt|test.txt)" and this pattern is matched against all file paths in the domain folder of your MindMeld application.

Warning

The order of the annotation rules matters. Each rule overwrites the list of entities to annotate for a file if the two rules include the same file. It is good practice to start with more generic rules first and then have more specific rules. Be sure to use the regex "or" (|) if applying rules at the same level of specificity. Otherwise, if written as separate rules, the latter will overwrite the former.

Warning

By default, all files in all intents across all domains will be annotated with all supported entities. Before annotating consider including custom annotation rules in config.py.

'language' (str): Language as specified using a 639-1/2 code.

'locale' (str): The locale representing the ISO 639-1 language code and ISO3166 alpha 2 country code separated by an underscore character.

'unannotate_supported_entities_only' (boolean): By default, when the unannotate command is used only entities that the Annotator can annotate will be eligible for removal.

'unannotation_rules' (list): List of annotation rules in the same format as those used for annotation. These rules specify which entities should have their annotations removed. By default, files is None.

'spacy_model_size' (str): lg is used by default for the best performance. Alternative options are sm and md. This parameter is optional and is specific to the use of the SpacyAnnotator and MultiLingualAnnotator. If the selected model is not in the current environment it will automatically be downloaded. Refer to Spacy's documentation to learn more about their NER models.

'translator' (str): This parameter is used by the MultiLingualAnnotator. If Google application credentials are available and have been exported, set this parameter to GoogleTranslator. Otherwise, set this paramter to NoOpTranslator.

Using the Bootstrap Annotator

The BootstrapAnnotator speeds up the data annotation process of new queries. When a BootstrapAnnotator is instantiated a NaturalLanguageProcessor is built for your app. For each intent, an entity recognizer is trained on the existing labeled data. The BootstrapAnnotator uses these entity recognizers to predict and label the entities for your app if you have existing labeled queries. The BootstrapAnnotator labels the entities for new queries using the trained entity recognizer for each given intent.

First, ensure that files that you would like to label have the same name or pattern. For example, you may label your files train_bootstrap.txt across all intents.

Update the annotator_class field in your AUTO_ANNOTATOR_CONFIG to be BootstrapAnnotator and set your annotation rules to include your desired patterns. You can optionally set the confidence_threshold for labeling in the config as shown below. For this example, we will set it to 0.95. This means that entities will only be labeled if the entity recognizer assigns a confidence score over 95% to the entity.

AUTO_ANNOTATOR_CONFIG = {
        "annotator_class": "BootstrapAnnotator",
        "confidence_threshold": 0.95,
        ...
        "annotation_rules": [
                {
                        "domains": ".*",
                        "intents": ".*",
                        "files": ".*bootstrap.*\.txt",
                        "entities": ".*",
                }
        ],
}

Check your ENTITY_RECOGNIZER_CONFIG in config.py. Make sure that you explicitly specify the regex pattern for training and testing and that this pattern does not overlap with the pattern for your unlabeled data (E.g. train_bootstrap.txt).

ENTITY_RECOGNIZER_CONFIG = {
        ...
        'train_label_set': 'train.*\.txt',
        'test_label_set': 'test.*\.txt'
}

To run from the command line:

mindmeld annotate --app-path "hr_assistant"

Alternatively, you can annotate by creating an instance of the BootstrapAnnotator class and running the Python code below. An optional param overwrite can be passed in here as well.

from mindmeld.auto_annotator import BootstrapAnnotator
annotation_rules: [
        {
                "domains": ".*",
                "intents": ".*",
                "files": ".*bootstrap.*\.txt",
                "entities": ".*",
        }
]
ba = BootstrapAnnotator(
        app_path="hr_assistant",
annotation_rules=annotation_rules,
confidence_threshold=0.95,
)
ba.annotate()

Note

The Bootstrap Annotator is different from the predict command-line function. Running python -m hr_assistant predict train_bootstrap.txt -o labeled.tsv will output a tsv with annotated queries. Unlike the Bootstrap Annotator, the predict only annotates a single file and does not use the entity recognizer of a specific intent. Instead, it uses the intent classified by nlp.process(query_text).

Creating a Custom Annotator

The MultiLingualAnnotator is a subclass of the abstract base class Annotator. The functionality for annotating and unannotating files is contained in Annotator itself. A developer simply needs to implement two methods to create a custom annotator.

Custom Annotator Boilerplate Code

This section includes boilerplate code to build a CustomAnnotator class to which you can add to your own python file, let's call it custom_annotator.py There are two "TODO"s. To implement a CustomAnnotator class a developer has to implement the parse() and supported_entity_types() methods.

class CustomAnnotator(Annotator):
        """ Custom Annotator class used to generate annotations.
        """

        def __init__(
                self,
                app_path,
                annotation_rules=None,
                language=None,
                locale=None,
                overwrite=False,
                unannotate_supported_entities_only=True,
                unannotation_rules=None,
                custom_param=None,
        ):
                super().__init__(
                        app_path,
                        annotation_rules=annotation_rules,
                        language=language,
                        locale=locale,
                        overwrite=overwrite,
                        unannotate_supported_entities_only=unannotate_supported_entities_only,
                        unannotation_rules=unannotation_rules,
                )
                self.custom_param = custom_param
                # Add additional params to init if needed

        def parse(self, sentence, entity_types=None, **kwargs):
                """
                Args:
                        sentence (str): Sentence to detect entities.
                        entity_types (list): List of entity types to parse. If None, all
                                possible entity types will be parsed.
                Returns:
                        query_entities (list[QueryEntity]): List of QueryEntity objects.
                """

                # TODO: Add custom parse logic

                return query_entities

        @property
        def supported_entity_types(self):
                """
                Returns:
                        supported_entity_types (list): List of supported entity types.
                """

                # TODO: Add the entities supported by CustomAnnotator to supported_entities (list)

                supported_entities = []
                return supported_entities

if __name__ == "__main__":
        annotation_rules: [
                {
                        "domains": ".*",
                        "intents": ".*",
                        "files": ".*",
                        "entities": ".*",
                }
        ]
        custom_annotator = CustomAnnotator(
                app_path="hr_assistant",
                annotation_rules=annotation_rules,
        )
        custom_annotator.annotate()

To run your custom Annotator, simply run in the command line: python custom_annotator.py. To run unannotation with your custom Annotator, change the last line in your script to custom_annotator.unannotate().

Getting Custom Parameters from the Config

spacy_model_size is an example of an optional parameter in the config that is relevant only for a specific Annotator class.

AUTO_ANNOTATOR_CONFIG = {
        ...
        "spacy_model": "en_core_web_md",
        ...
}

If a SpacyAnnotator is created using the command-line, it will use the value for spacy_model_size that exists in the config during instantiation.

A similar approach can be taken for custom Annotators.