Need more informations about catastrophic forgetting problem

Sorry for the delay getting back to you, and for the lack of clarity on this. Also, happy Easter :rabbit: :chocolate_bar:

The truth is that precise "just follow these steps" instructions simply don't exist for training new statistical models, on new data sets. One reason for this is that every problem is different. Some entity recognition problems are very easy. It's also possible to have annotations which the model will be completely unable to learn (possibly even in principle).

This means there's no way to give clear guidance on how many examples you might need, or what might be wrong with your current data, or what you might need to do next. The only way to give that level of guidance would be to download your data and start working on your problem; which is a level of support we're not currently able to offer.

The best I can do is make a few guesses based on what you've said. I can also offer a few general observations. Some of these things are also a matter of opinion --- it's possible a different expert would disagree.

83 examples isn't very many. For a sense of scale, the en_core_web_sm model achieves 86% accuracy after being trained on around one million words annotated with entity types. There are 21,104 person mentions in the data set --- and yet if you look at the results of the model in the web demo (via https://demos.explosion.ai ), you'll see it still makes many errors on the PERSON category --- and yet spaCy's entity recognizer is close to the current state-of-the-art. I'm not saying you necessarily need thousands of annotations. But that's how many are needed to give that level of performance on English, for that particular entity type.

Maybe your problem is easy to learn, and it can be learned with only twenty or thirty examples. Or maybe the problem is defined such that the model won't learn an accurate model even on millions of examples. It's definitely possible that more annotations will help, though.

When you say "phrases in a text file", do you mean that the file has only the phrases? The entity recogniser really assumes you're tagging phrases in context. Otherwise it's better to build a terminology list with terms.teach, and use the pattern matcher.

Out of interest, how long did it take you to make the 300 annotations?

With Prodigy I usually find the annotation to be very quick, even just using the manual mode. It depends on the entity density, but as a quick calculation: If each text is one to two sentences, I would expect each text to take less than 20-30 seconds to annotate, which means you would get 150 texts per hour, and around 1,000 per day.

If you use the matcher or a pre-trained model to pre-set the annotations with ner.make-gold, it's often even faster. Finally, once you have a sufficiently accurate model (or pattern file), the ner.teach recipe can be even faster still. But while you have only a few annotations, the ner.manual mode is a good way to get started.

The entity recognition in the fr_core_news_sm model is based on "silver standard" data from Wikipedia. This may perform very poorly on your task: Wikipedia itself is quite unlike other text types, and the entity mentions are skewed by the Wikipedia editorial standards. So, the training data for the initial model may be a poor start for what you're doing. It might be better to start from a blank model with vectors trained on your data. Possibly.

For only a few hundred annotations, 70% accuracy actually isn't so bad!

A final possibility: even if you do everything right, sometimes the model may still fail to achieve useful accuracy. We refer to the process of training and evaluating a model as an "experiment" because we don't know the result ahead of time. This is one of the reasons we designed Prodigy with an emphasis on rapid iteration: because some ideas simply don't work.