I have a quite specific question regarding leveraging patterns in prodigy. So, let’s say I start annotating a dataset with 7 labels, and later merge this dataset with a different dataset that has 8 labels. But now, with 8 labels, I want to incorporate patterns, but I already have 200 entries that have been annotated, and of those, 30 of them contain one of the patterns that I have created in my patterns file — but, because these 30 entries have already been saved to the server, the pattern will not be recognized in the annotation (I have done a mock test of this and found that there was no way to add the annotation of the pattern unless I started all over by importing the merged dataset to a new dataset with the saved annotations — but this requires manually accepted all previous annotations which can take some time after 1,000 entries (unless there is a more effective way to do this that I am not aware of!)).
Assuming the above logic is correct, because I now have 30 annotations that do not have the full annotation (because they are missing the pattern label), I am wondering how important it is to catch all labels? Will there be a negative impact regarding the specific word not being labelled at all in these 30 instances?
Just to make sure I understand the problem correctly: You have some annotated examples that weren't yet annotated with label 8. You have patterns for label 8, and you want to add those pattern matches to the examples that are already in your dataset.
If you run a recipe like ner.manual with patterns and examples with pre-defined "spans", those spans will be overwritten. That's expected – otherwise, the results would be pretty confusing, you'd constantly have to resolve overlaps between existing spans and matches etc.
However, if you have a way to identify the examples with unannotated matches, you can re-annotate only those, or go over them again to only add label 8. One approach would be to use spaCy's Matcher or PhraseMatcher directly, add your patterns for label 8, loop over the examples in your dataset and match on each text. This lets you extract all annotated examples that contain matches for label 8. You can then queue them up again for annotation.
If you're training your model the "regular" way and under the assumption that all unannotated tokens are not entities, this would be a problem, yes. The impact on accuracy may not be huge for 30 examples out of thousands, but it can lead to worse results.
Essentially, the model will try to "make sense of" the fact that one a word is labelled in one context, and not labelled in a different similar context. The weights you train will reflect that – but if they're based on incorrect labels, the may lead to worse results. Similarly, if some of those 30 examples end up in your evaluation set, your evaluation will be wrong, because you're validating your model against incorrect answers (e.g. penalising it if if predicts label 8 in one of those texts). This means your accuracy will be unreliable.
By "ignore an annotation", do you mean ignore/skip it during training? Or when you load in the examples?
You can always write a script that loads your examples and then either filters out the examples you want to ignore (based on whichever logic), or sets "answer": "ignore", which is equivalent to hitting the "ignore" button in the UI and means that the example won't be used for training. For example:
from prodigy.components.db import connect
ignored_labels = ["BAD_LABEL1", "BAD_LABEL2"]
db = connect()
examples = db.get_dataset("your_dataset") # load annotations
for eg in examples:
for span in eg.get("spans", ): # iterate over annotated spans
if span["label"] in ignored_labels:
# Example contains span with bad label, so we set its answer
# to "ignore" and move on to the next example
eg["answer"] = "ignore"
db.add_dataset("new_dataset") # create a new dataset for filtered examples
db.add_examples(examples, ["new_dataset"]) # add modified examples to new set
Just make sure to use a new dataset, so you don't end up with conflicting answers. You can then train from the new dataset, and the ignored examples will be skipped.