ner.teach - couple of questions


I've trained a model with gold-std data and now I want to improve it by using ner.teach.
The problem is that prodigy seems to show score 1.0 for almost all it's suggestions and the vast majority of it's suggestions are actually correct. It's not showing me a whole lot of the entities that I actually need to improve the score for either.

Should I run ner.teach only with the labels that I want to improve?

Another thing is that the % wen'up to 90% really fast and now, at 95%, the model doesn't seem to improve anymore.

I'm assuming that's normal?

I've also noticed that the model is making mistakes in this recipe, that it isn't making with ner.correct. I could be wrong about this but maybe it's the tokenization? I do have a custom tokenizer but that should be part of the model I'm using.
Also the UI shows me that the lang is en when it should be de (it is in my train.cfg), but I don't see a way to overwrite that.
It's worth noting that I'm also setting --unsegmented. My model doesn't have a sentencizer and whatever the recipe was doing, it mas messing up my samples big time :slight_smile:

Any suggestions would be much appreciated.

Hi! One quick note about the support for binary ner.teach annotations in the current nightly: this is the one feature we're still actively working on, so it's currently expected that you may see worse results when training from your binary data compared to v1.10. We'll have a new nightly available that we'll release once spaCy v3.1 is out, which should make this more stable and possibly lead to even better results.

This is expected, because the ner.teach workflow will use the beam parse with all possible interpretations of the given example for its suggestions and by default, it uses uncertainty sampling to choose which suggestions to ask about. So you will see suggestions that aren't necessarily the model's most confident predictions (i.e. those that it currently chooses when you just run it over your text and which is what you see in a workflow like ner.correct that just shows you the final predictions).

One thing to keep in mind is that when you're training from only binary annotations, the model is also only evaluated on a held-back sample of those examples. So the score here reflects how well the model does on the binary evaluation examples. Depending on how many examples you have in total, this may or may not be very representative. If the score goes up, it certainly shows that the model has learned something – but ideally, you'd still want to be evaluating the resulting model properly in a separate step, and check how its predictions improve on your dedicated, gold-standard evaluation data. (This is typically best done as a separate step and directly with spaCy.)

I really appreciate your detailed answer, thank you!

One more question, though: Can I run ner.teach with only one label, despite there being multiple labels in most samples?
I'm a little worried that I might be making the others worse, just in case this isn't a use-case that was intended.
I's only specific labels that I want to improve this way and since they are underrepresented in my dataset, it takes a while to actually see any of those.


That's a good question. If you only train on the one label, you might indeed cause what we call "catastrophic forgetting" where the model starts adjusting too much on the new data, and "forgets" what it learned before.

The prodigy train implementation for Prodigy 1.11 has a good way to remedy this, as you can provide multiple NER datasets to train on. This means that you can mix in annotations with all labels, with data obtained from ner.teach on just the one label. To obtain the data for all the labels, what you could do is run ner.correct and quickly glance whether it looks OK and hit accept if it does. You could even exclude the one label that is important to you for this part, so you don't have to judge it twice. If you're not too concerned about the other labels, this annotation should hopefully go rather quick, and it will give the model some nice examples for all other labels to mix in with your binary annotations.

Awesome, thank you!

Just released a new nightly that includes improvements around the ner.teach workflow. Could you try re-running your experiments and see how you go? :slightly_smiling_face: You'll now also be able to mix binary with manual annotations and ner.teach will ask about sentences with no entities in order to improve performance.

I've decided to go with a different scheme for annotating my data so I'm starting my dataset from scratch. If and when I get to the ner.teach phase, I'll let you know how it goes :slight_smile:

1 Like

I'm using Prodigy v1.11 and have several concerns that are similar to above.

  1. ner.teach appears only to show me examples with a score = 1.0. It doesn't sound like this is expected behavior. The underlying model has an F-score of 0.80, so there should be plenty of lower score examples.

  2. For records with multiple entities of the same label, ner.teach appears to show the record with only one entity highlighted at a time. That is, it shows me me a record with 1 of 2 entities highlighted. I reject. It then shows me the same record with the other entity highlighted. I'm also rejecting this. Both times, the score was 1.0. This seems strange.

One thing to keep in mind here is that the suggestions you see are based on a number of possible interpretations of the document, not just the model's actual predictions. Still, depending on the data, you'd still expect some lower score predictions here. Does this occur from the very beginning, or after doing some annotation? If it happens gradually, this could indicate that the model ends up in a weird state and the beam parse stops

Ah, that's not the correct interpretation of the interface: ner.teach will always ask you about a single entity at a time, and the goal is to give feedback on whether that particular entity is correct. So in the first case, you'd accept if the entity span is correct, and so on. If you're rejecting an entity, the feedback you give the model is "this particular entity span is incorrect, try again". That's not what you want here.

If you want to annotate and correct the complete, actual predictions made by the model, maybe you just want to use ner.correct instead. In Prodigy v1.11.+, you'll also be able to set the --update flag to update the model in the loop from your annotations.

1 Like

Thank you, Ines! When I use my own model trained with "prodigy train", using ner.teach only shows scores of 1.00 from the very beginning.

If I use a prebuilt pipeline (en_core_web_trf), it will start with lower scores at first, but after displaying a sequence of examples with no entities (correctly), it will switch back to showing cases with entities, but all annotations will be wrong and all scores will be 1.00 going forward. I created a separate ticket about this latter case.