Thanks so much for your detailed descriptions, video and step-wise procedure to reproduce the erratic behaviour. This was extremely useful to replicate and debug the situation on our end. We've done a detailed review of the
ner.teach implementation and have found the culprit that was causing this. In a nutshell: the internal updating of the model was working well for non-transformer models, but was using inappropriate settings for the optimizer (eg learning rate, etc) when transformer-based models were used. This resulted in the model taking "too big leaps" when being updated with new annotations, eventually ending up in a rubbish state.
I was able to replicate the original erratic behaviour on the spam text messages you linked, and I'm happy to report that after fixing the problem, the behaviour does not seem to appear anymore. We'll be working towards a small bug release that includes this fix.
All that said - I do want to provide a little bit more context about the
ner.teach recipe as well. By design, it focuses on cases that the model are uncertain about. It typically starts off with a few "straightforward" annotations (often with score 1 when using a well-trained transformel model) but then will go into the more "uncertain" space. This doesn't necessarily mean that your model is starting to do worse, because the "certain" predictions simply aren't shown anymore after some time. It's good to keep that in mind when running this recipe and interpreting the scores & predictions. However, your original observation that punctuation was tagged with a 1.0 score definitely pointed towards an error (which is now fixed as I explained). There may still be cases where punctuation is tagged, but hopefully these should have a low score so you can reject them.
Also, I want to echo Ines' recommendation to look into
ner.correct as well. To avoid catastrophic forgetting, you could run your original model on a bunch of text and use these predictions as "silver" annotations that you would then mix into the gold annotations you've created with
ner.correct. That ensures that the model doesn't "forget" its old behaviour, while still learning about the new cases as well.
Let us know if you have any further doubts or questions!