Prodigy newbie here after spending much time creating training data using code and labelling functions.
I'm using ner.teach along with a patterns file to annotate text for a new NER model. This is working pretty well, but I'd really like to correct some of the partial spans suggested by Prodigy. For example, Prodigy suggests a partial span for a URL - I'd like to be able to change that span boundary within Prodigy to include the full URL
In some cases, it might be that I need to modify the default spaCy tokenization model (rather than correct in Prodigy), but in other cases, I might want to extend the suggested span to include additional tokens in Prodigy.
Hi! What you describe is definitely possible – it might just take some experimentation to find the best configuration that produces the most useful results. For example, whether to upate the model with complete or incomplete annotations, how to incorporate patterns (if at all) etc.
It probably makes sense to start off with a recipe like ner.correct that adds entities to the outgoing texts and lets you edit them manually, and then add the update functionality to it that updates the model in the loop. Essentially, all you need to add here is an update callback that calls nlp.update with the answers received back from the app. Here's a template recipe that shows how it works:
A few questions that you'll have to ask yourself and that your final workflow will depend on:
1. Update from complete gold-standard annotations or partial annotations
A text may have multiple entities of different types, so do you expect the annotation you send back to be complete? Or, phrased differently, should all unannotated tokens be considered "not part of an entity"? This changes the way you update the model.
If you expect the annotations to be complete, updating is easier because all unannotated tokens are considered O (outside an entitiy). But if that's the strategy, you'll need to make sure that this is always true for the annotations you collect – otherwise, you'll update the model with incorrect information and it'll learn the wrong thing.
If you don't expect the annotations to be complete, you need to make sure that the model is updated with "-" or None values for the tokens you don't have an answer for. For example (see here for examples of updating with Doc/GoldParse):
# We only know that "Facebook" is a U-ORG, but nothing about the other tokens
gold = GoldParse(doc, entities=["U-ORG", "-", "-", "-", "-"])
2. Incorporate patterns or not
In ner.teach, patterns are used to pre-select interesting examples and to make sure your model sees enough positive examples in the beginning. If you're showing all entities predicted by the model, it's unclear how the patterns should fit in. Should they be added on top, and if so, which entities take precendence if there are overlaps? If they're presented separately, what do they "mean" in terms of updating the model? Is the model updated differently from corrected matches?
I'd probably recommend not using patterns at all in this approach because they can easily be counterproductive because their usage just opens up so many other questions.