Train a new NER entity with multi-word tokens

Hello @ines

thanks a lot for your help. I really appreciate how much time you take to give a detailed answer. That was helpful and I managed to solve my problem by writing a recipe that allows me to export patterns from annotations created via ner.manual.

I will post here the notes to myself that I collected during the process in case someone else runs into the same problem:

Terms

First we have to come up with terms that represent disasters. So I create a new dataset (disaster_terms) and then we can use prodigy to find the write terms. I initialized it with the terms from the Triple R database and prodigy will use the GloVe model to suggest similar terms.

prodigy dataset disaster_terms "Seed terms for DISASTER"
prodigy terms.teach disaster_terms en_core_web_lg --seeds "blackout,flood,earthquake,tsunami,storm,fire,avalanche,thunderstorm,flooding,volcano,blizzard"

Saved 150 annotations to database SQLite
Dataset: disaster_terms
Session ID: 2018-01-17_15-22-24

But I also noticed that for the volcano terms some of then consist of multiple words, like volcanic eruption. I will use the ner.manual recipe to get more of those words.

prodigy ner.manual volcano_terms en_core_web_lg "eruption" --api guardian --label "DISASTER"

Export the patterns for teaching

Normally I can simply use the terms.to-patterns recipe to export terms created by terms.teach. However the multi-word are stored in a different format, so I had to write myself a little exporter recipe.

@recipe('terms.manual-to-patterns',
        dataset=recipe_args['dataset'],
        output_file=recipe_args['output_file'],
        label=recipe_args['label'])
def to_patterns(dataset=None, label=None, output_file=None):
    """
    Convert a list of seed terms to a list of match patterns that can be used
    with ner.match. If no output file is specified, each pattern is printed
    so the recipe's output can be piped forward to ner.match.
    """
    def get_patterns(term):
        return [{
            'label': span['label'],
            'pattern': term['text'][span['start']:span['end']]
          } for span in term['spans']]

    log("RECIPE: Starting recipe terms.to-patterns", locals())
    if dataset is None:
        log("RECIPE: Reading input terms from sys.stdin")
        terms = (json.loads(line) for line in sys.stdin)
    else:
        terms = DB.get_dataset(dataset)
        log("RECIPE: Reading {} input terms from dataset {}"
            .format(len(terms), dataset))
    if output_file:
        pattern_lists = [get_patterns(term) for term in terms
                         if term['answer'] == 'accept']
        patterns = sum(pattern_lists, [])
        log("RECIPE: Generated {} patterns".format(len(patterns)))
        write_jsonl(output_file, patterns)
        prints("Exported {} patterns".format(len(patterns)), output_file)
    else:
        log("RECIPE: Outputting patterns")
        for term in terms:
            if term['answer'] == 'accept':
              for pattern in get_patterns(term):
                print(json.dumps(pattern))

So here the final step to prepare all patterns for the teaching.

prodigy terms.to-patterns disaster_terms disaster_patterns.jsonl --label DISASTER
prodigy terms.manual-to-patterns volcano_terms volcano_patterns.jsonl
cat disaster_patterns.jsonl volcano_patterns.jsonl > all_disaster_patterns.jsonl
8 Likes