Combining ner.teach with patterns file and manual correction of spans

philgooch · September 9, 2020, 12:35pm

Hi there

Prodigy newbie here after spending much time creating training data using code and labelling functions.

I'm using ner.teach along with a patterns file to annotate text for a new NER model. This is working pretty well, but I'd really like to correct some of the partial spans suggested by Prodigy. For example, Prodigy suggests a partial span for a URL - I'd like to be able to change that span boundary within Prodigy to include the full URL

In some cases, it might be that I need to modify the default spaCy tokenization model (rather than correct in Prodigy), but in other cases, I might want to extend the suggested span to include additional tokens in Prodigy.

Apologies if this has been covered elsewhere.

Many thanks!

Phil

ines · September 10, 2020, 12:17pm

Hi! What you describe is definitely possible – it might just take some experimentation to find the best configuration that produces the most useful results. For example, whether to upate the model with complete or incomplete annotations, how to incorporate patterns (if at all) etc.

It probably makes sense to start off with a recipe like ner.correct that adds entities to the outgoing texts and lets you edit them manually, and then add the update functionality to it that updates the model in the loop. Essentially, all you need to add here is an update callback that calls nlp.update with the answers received back from the app. Here's a template recipe that shows how it works:

github.com

explosion/prodigy-recipes/blob/master/ner/ner_make_gold.py

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string, set_hashes
import spacy
import copy
from typing import List, Optional


def make_tasks(nlp, stream, labels):
    """Add a 'spans' key to each example, with predicted entities."""
    # Process the stream using spaCy's nlp.pipe, which yields doc objects.
    # If as_tuples=True is set, you can pass in (text, context) tuples.
    texts = ((eg["text"], eg) for eg in stream)
    for doc, eg in nlp.pipe(texts, as_tuples=True):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            # Continue if predicted entity is not selected in labels
            if labels and ent.label_ not in labels:

This file has been truncated. show original

A few questions that you'll have to ask yourself and that your final workflow will depend on:

1. Update from complete gold-standard annotations or partial annotations

A text may have multiple entities of different types, so do you expect the annotation you send back to be complete? Or, phrased differently, should all unannotated tokens be considered "not part of an entity"? This changes the way you update the model.

If you expect the annotations to be complete, updating is easier because all unannotated tokens are considered O (outside an entitiy). But if that's the strategy, you'll need to make sure that this is always true for the annotations you collect – otherwise, you'll update the model with incorrect information and it'll learn the wrong thing.

If you don't expect the annotations to be complete, you need to make sure that the model is updated with "-" or None values for the tokens you don't have an answer for. For example (see here for examples of updating with Doc/GoldParse):

# We only know that "Facebook" is a U-ORG, but nothing about the other tokens
gold = GoldParse(doc, entities=["U-ORG", "-", "-", "-", "-"])

2. Incorporate patterns or not

In ner.teach, patterns are used to pre-select interesting examples and to make sure your model sees enough positive examples in the beginning. If you're showing all entities predicted by the model, it's unclear how the patterns should fit in. Should they be added on top, and if so, which entities take precendence if there are overlaps? If they're presented separately, what do they "mean" in terms of updating the model? Is the model updated differently from corrected matches?

I'd probably recommend not using patterns at all in this approach because they can easily be counterproductive because their usage just opens up so many other questions.

philgooch · September 11, 2020, 8:34am

Thanks Ines, this is super-helpful. I'll give your suggestions a try. I think ner.correct looks like a good option to start with.

Really grateful for your comprehensive answer! Thanks again

Phil

Topic		Replies	Views
Combine ner.teach and ner.correct? enhancement , usage , ner	1	553	November 20, 2020
how to use ner.correct --update usage , ner , solved	4	686	October 21, 2021
Is there a way to use spans.correct with patterns? usage , spancat	2	449	October 28, 2022
How do I add a --patterns option to ner.make-gold? ner , solved	11	1809	October 25, 2018
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018

Combining ner.teach with patterns file and manual correction of spans

1. Update from complete gold-standard annotations or partial annotations

2. Incorporate patterns or not

Related topics