Two Questions on Teach recipes

Hello! I have two questions on how to use ner.teach and textcat.teach.

We have trained an NER model using ~1000 messages and 20 entities. We are hoping to use ner.teach to improve our model, but I can't quite figure out how to use that recipe. When I export the results from ner.teach, the messages only have the singular entity I validated and not all entities in the message. Should I then run these sentences through ner.correct, or is that not necessary? How can I best use these results?

And finally, I also built a textcat model using a similar number of messages. But when I go run textcat.teach, I get a message saying there are no tasks. I trained the model outside of prodigy in spacy, could that be why?


This is the concept, yes – instead of labelling all entities in the text by hand or in a semi-automated way, ner.teach uses the model to suggest entities (based on all possible analyses of the text) and asks you whether it's correct or not. Even if you don't know the answer for every single token (and in some cases, only know that a certain analysis is not correct), you can still update a model proportionally and move it into a more correct direction.

Prodigy's training recipes implement a mechanism to update from this type of binary incomplete data (ner.batch-train or the new train with --binary). So after collecting annotations with ner.teach, you'd then update your base model with those annotations and it hopefully produces better predictions than before.

Do the labels you've set on the command line match the labels in the model? Otherwise, Prodigy can't find any suggestions. Also, are you using a new dataset? If you've already annotated those examples before and they're in the same dataset, Prodigy will skip them (so you don't get asked the same question twice) and you'll end up with an empty stream.

(Whether you train the model with spaCy directly or in Prodigy won't make a difference for things like this – in both cases, you're doing the same thing and calling nlp.update with examples.)

Thank you so much!! I used ner.batch-train, but when I evaluate my new model against our existing test set, my F1 score drops by about 10% (precision stayed the same but recall went way down). Is this just saving the one annotated label marked as correct in ner.teach, and thus "training" the model to only return one entity?

If you're using ner.batch-train or train with --binary then no – the training process here was specifically designed for incomplete and binary yes/no answers. Before training, Prodigy will merge all annotations on the same texts, and all unannotated tokens will be considered missing values. (Btw, if you're interested in how the updating process works for binary annotations, my slides here show an example.)

Depending on the annotations and specific texts and entities, it can always happen that the binary annotations don't move the needle very much. If training the model on the binary annotations makes the model significantly worse, you might also want to double-check that your data is consistent (e.g. review a random sample using the review recipe). The dataset you're training from should only contain binary annotations, and shouldn't label any partial suggestions as accepted (see here for background on this).

Hi Ines!! After looking through our data, I definitely agree that we have some inconsistently tagged data in our training set. Do you have any advice on how to identify possibly mis-tagged data from prodigy?

The review recipe and interface is a good way to re-annotate data. You can see the original annotation and get to "overrule" it if it's incorrect. The original example is saved with the new annotation, so you don't lose the reference. The recipe also groups annotations together if you have overlaps (e.g. the same data annotated by multiple people).

If you can identify patterns that indicate that an annotation might be incorrect, that's very helpful, too, because it lets you write a script to pre-select candidates (so you don't have to do through all examples again). You can then queue those up for annotation first, and maybe perdiodically re-run training experiments to see if the corrected data makes a difference.