ner.teach starts going wacky about 30 examples in

I'm using ner.teach to fine-tune a model to detect ORG entities. It starts off great and presents reasonable examples that are fairly accurate. About 20 examples in, it (correctly) shows a sequence of examples with no entities, each with a score of 1.00. It then switches back to showing examples with entities, but this time around, it's clear that something broke the model. It starts highlighting non-sense... things like punctuation, articles of speech, etc... and every one of these examples has a score of 1.00. It never recovers... clearly a bug.

Here is the command line I'm using to kick off the process:

prodigy ner.teach data_binary en_core_web_trf data.txt --label ORG -U

Could it be that there it's incrementally training the model and something about the sequence of no-entity examples causes it to forget everything?

1 Like

Hi! I already replied to your other post and it sounds like one problem might be this:

If you're rejecting suggestions that are correct, the model will be updated with the information that "this span is not an entity" and it can get confused and will try to come up with interpretations that match this new information. So it's definitely possible that after a few batches, the model ends up in a weird state where it starts to suggest you completely arbitrary spans.

Also, moving the discussion from the previous thread over:

How are you evaluating your model? Do you have a dedicated evaluation set, or are you just evaluating against a held-back percentage of the binary annotations? If you're evaluating against binary annotations and you only have a small set of annotations, you can easily end up with relatively unreliable evaluation scores: you're only evaluating against the spares binary information, so you won't know if any of the other predictions are correct or not. And if some of the evaluation examples are examples with no entities that the model gets correct very reliably, you may end up with an accuracy of 100% that's not actually very representative of the overall performance. So it's usually much better to evaluate against a stable set of gold-standard annotations.

Tnx Ines.

I reran ner.teach and accepted cases where it highlights one entity correctly, but still run into the issue of the model suddenly giving crazy results after a while. Here's what happened on the latest run:

It shows 10 examples, gets 9/10 correct. Then shows "Loading..." for about 10 seconds
Shows 12 examples... gets 12/12 correct. Displays "Loading..." then "No tasks available" for about 10 seconds before loading more examples.
Displays another 10 examples... gets 8/10 correct. Shows "Loading..." for 10 seconds.
Displays another 12... first 6/6 correct. The 6th - 12th in this set (correctly) have no entities and score 1.00. I accept.
Shows 10 more examples. First 3 have no entities and are all correct. Next 7 have entities and get 7/7 correct.
Loads 12 more examples. Gets 4/4 correct. On the 5th example, highlights only a colon (:slight_smile: (with a score of 1.00). Reject. Same on example 6... only a colon in the text is highlighted. Reject. Example 8, an article of speech is highlighted with score of 1.00. Reject. Same with example 9.
Loads more examples. From this point forward, everything is wrong and everything has a score of 1.00. It's highlighting entire sentences, articles of speech, etc.

Here's how I reproduced the same error on public data (Spam Text Message Classification | Kaggle).

  1. Create a text file of this data with 1 row per message. Only include txt. I named the file "messages.txt". Screenshot:

  2. Run the following:
    prodigy ner.teach messages_binary en_core_web_trf messages.txt --label ORG -U

  3. Proceed to accept/reject records. At the 50th example, the model starts showing crazy highlights with everything having a score of 1.00. Here is what it starts showing on example 50:

50:

51:

55:
image

Here is a link to a 2.5 min video I made to demonstrate the issue on the above public data. https://youtu.be/G6yn3_mJK5M

The issue is apparent after the 01:57 mark.

@ines - I hope the video shows the issue clear enough. What do you think is causing the observed behavior of ner.teach around 1:57?

It's interesting that this issue (the model producing arbitrary annotations) consistently happens after the series of examples with no entities.

Thanks for the super detailed report and resources, this is really helpful! I haven't had time to re-run and test it in detail yet but some first ideas:

It's possible that this is some interaction of the beam search / updating in combination with the transformer. Did you manage to reproduce the same results one of the CNN models? The transformer part is a bit surprising because I would have normally expected it to converge slower because it's less sensitive to these very small batch updates. (We've normally found it to be more effective to just collect a small dataset of gold-standard annotations and just train a transformer-based pipeline from scratch, which can often give you pretty good results with pretty small datasets, depending on the use case.)

This is a very interesting observation :thinking: (And interestingly, we actually started including examples with no entities to make the process more robust.) But this is a very good starting point for debugging!

We'll definitely look into this – it might be a bug in Prodigy's NER annotation model implementation, or it could also point to a bug in spaCy.

In the meantime, if you're using a transformer-based pipeline anyway, it probably makes more sense to switch to a more manual workflow like ner.correct for now and then retrain your model from scratch on a smaller dataset (because you'll likely need fewer examples anyway).

Thank you, @ines. 2 follow-up questions:

  1. When you ask if I was able to reproduce results on one of the CNN models, what models are you referring to? I was running the ner.teach example on en_core_web_trf.

  2. I like your suggestion of building a model from scratch on data from ner.correct, but here's where I'm struggling:

  • if I train a model from scratch using 2,000 examples collected by ner.correct (run with base model en_core_web_trf), scored results are much worse vs. scoring with en_core_web_trf alone (much lower recall; probably due to en_core_web_trf's larger vocabulary vs. my 2,000 ner.correct annotations).
  • alternatively, if I take en_core_web_trf and further train it using the 2000 annotations, I run into catastrophic forgetting.

There must be a better way to combine insights from both models. The best approach I've found so far is to score data separately with each of the en_core_web_trf model and the model built from scratch on 2000 annotations, then combine the results post scoring. But this feels sloppy (and is slower to score)... there must be a way to build one model that achieves the same.

Ah, by CNN models I meant the non-transformer pipelines like en_core_web_sm / _md / _lg.

By "from scratch", I meant just training a new transformer-based pipeline with just the transformer embeddings and your new annotations, and not trying to update the existing en_core_web_trf pipeline. You already gain a lot from the embeddings, so if you use an existing pipeline to help you annotate (e.g. by using ner.correct with a trained pipeline and the labels you're interested in), you might not need that many examples to get similar results on your specific data.

Thank you @ines. I will try both these out and let you know how it turns out. Many thanks for your wisdom!

Hi @tdauria,

Thanks so much for your detailed descriptions, video and step-wise procedure to reproduce the erratic behaviour. This was extremely useful to replicate and debug the situation on our end. We've done a detailed review of the ner.teach implementation and have found the culprit that was causing this. In a nutshell: the internal updating of the model was working well for non-transformer models, but was using inappropriate settings for the optimizer (eg learning rate, etc) when transformer-based models were used. This resulted in the model taking "too big leaps" when being updated with new annotations, eventually ending up in a rubbish state.

I was able to replicate the original erratic behaviour on the spam text messages you linked, and I'm happy to report that after fixing the problem, the behaviour does not seem to appear anymore. We'll be working towards a small bug release that includes this fix.

All that said - I do want to provide a little bit more context about the ner.teach recipe as well. By design, it focuses on cases that the model are uncertain about. It typically starts off with a few "straightforward" annotations (often with score 1 when using a well-trained transformel model) but then will go into the more "uncertain" space. This doesn't necessarily mean that your model is starting to do worse, because the "certain" predictions simply aren't shown anymore after some time. It's good to keep that in mind when running this recipe and interpreting the scores & predictions. However, your original observation that punctuation was tagged with a 1.0 score definitely pointed towards an error (which is now fixed as I explained). There may still be cases where punctuation is tagged, but hopefully these should have a low score so you can reject them.

Also, I want to echo Ines' recommendation to look into ner.correct as well. To avoid catastrophic forgetting, you could run your original model on a bunch of text and use these predictions as "silver" annotations that you would then mix into the gold annotations you've created with ner.correct. That ensures that the model doesn't "forget" its old behaviour, while still learning about the new cases as well.

Let us know if you have any further doubts or questions!