I am trying to learn a new entity type and i have to say "catastrophic forgetting problem" is really a big problem for me .
I have 2000 phrases in a txt file.
Another file with 83 materials (my new entity).
i am working with last version of prodigy (1.4.1) and spacy (2.0.10).
Prodigy tool is clear for me now but i don't really know how to use prodigy to avoid this behavior.
I found a lot of post speaking about this problem but they can't help me as i want.
Particulary this one : New entity model ruins other entities
i read and read again but no result.
This blog post we’ve published has some more background on this, including strategies to prevent it. One approach is to mix in examples that the model previously got right and train on both those examples and the new examples.
This is pretty easy to do in Prodigy – after collecting annotations for your new TECH entity, run the model on the same input text, and annotate the other labels. You can add all annotations to the same dataset, and then train your model with those examples. Make sure to always use input data that’s similar to what the model have to process at runtime. This might also give you a little boost in accuracy over the standard English model, because you’re also improving the existing entity types on your specific data.
May i ask you to be more precise about process we can follow to avoid that.
I will try to give you more details (my english is not as good as yours )
I tried a lot of things since 4 months to avoid this behavior.
First step : My last tests consisted to annotate revision data file with defaults NER entities (PER,LOC,MISC,ORG) in a new dataset with ner.train recipe. My revision data file containing almost 200 simple phrases and entites are nicely detected.
If i understand correctly, this step should be enough to avoid “forgotten”. so i tried, maybe, 300 annotations.
Second step : i added annotations on the same dataset with a new label “MATERIAL” applied on a text file containing 2000 phrases with ner.train too. Loaded model is still the same : original model (fr_core_news_sm)
Third step : Tried to use ner.make-gold recipe to correct bad annotations (with --label “MISC,ORG,LOC,PER,MATERIAL” => is it a good approach ? or should i correct label one by one ?)
Next step : ner.batch-train. Best accuracy is around 0.703 but numerous words are tag while they should not be.
So i am trying to optimize my data file but i don’t really think is a good solution and really, i don’t know what i am doing wrong.
I hope this description will help you and my english is understandable
So no one can help me ?
Is my question not conventional ?
Do you need more information ?
Are you making a video to help people with this chronic forgetting problem ?
An answer would be appreciate. Because at the moment, i am completely blocking. I don’t know how to passing through this behavior.
All of my tests are a fail.
Sorry for the delay getting back to you, and for the lack of clarity on this. Also, happy Easter
The truth is that precise "just follow these steps" instructions simply don't exist for training new statistical models, on new data sets. One reason for this is that every problem is different. Some entity recognition problems are very easy. It's also possible to have annotations which the model will be completely unable to learn (possibly even in principle).
This means there's no way to give clear guidance on how many examples you might need, or what might be wrong with your current data, or what you might need to do next. The only way to give that level of guidance would be to download your data and start working on your problem; which is a level of support we're not currently able to offer.
The best I can do is make a few guesses based on what you've said. I can also offer a few general observations. Some of these things are also a matter of opinion --- it's possible a different expert would disagree.
83 examples isn't very many. For a sense of scale, the en_core_web_sm model achieves 86% accuracy after being trained on around one million words annotated with entity types. There are 21,104 person mentions in the data set --- and yet if you look at the results of the model in the web demo (via https://demos.explosion.ai ), you'll see it still makes many errors on the PERSON category --- and yet spaCy's entity recognizer is close to the current state-of-the-art. I'm not saying you necessarily need thousands of annotations. But that's how many are needed to give that level of performance on English, for that particular entity type.
Maybe your problem is easy to learn, and it can be learned with only twenty or thirty examples. Or maybe the problem is defined such that the model won't learn an accurate model even on millions of examples. It's definitely possible that more annotations will help, though.
When you say "phrases in a text file", do you mean that the file has only the phrases? The entity recogniser really assumes you're tagging phrases in context. Otherwise it's better to build a terminology list with terms.teach, and use the pattern matcher.
Out of interest, how long did it take you to make the 300 annotations?
With Prodigy I usually find the annotation to be very quick, even just using the manual mode. It depends on the entity density, but as a quick calculation: If each text is one to two sentences, I would expect each text to take less than 20-30 seconds to annotate, which means you would get 150 texts per hour, and around 1,000 per day.
If you use the matcher or a pre-trained model to pre-set the annotations with ner.make-gold, it's often even faster. Finally, once you have a sufficiently accurate model (or pattern file), the ner.teach recipe can be even faster still. But while you have only a few annotations, the ner.manual mode is a good way to get started.
The entity recognition in the fr_core_news_sm model is based on "silver standard" data from Wikipedia. This may perform very poorly on your task: Wikipedia itself is quite unlike other text types, and the entity mentions are skewed by the Wikipedia editorial standards. So, the training data for the initial model may be a poor start for what you're doing. It might be better to start from a blank model with vectors trained on your data. Possibly.
For only a few hundred annotations, 70% accuracy actually isn't so bad!
A final possibility: even if you do everything right, sometimes the model may still fail to achieve useful accuracy. We refer to the process of training and evaluating a model as an "experiment" because we don't know the result ahead of time. This is one of the reasons we designed Prodigy with an emphasis on rapid iteration: because some ideas simply don't work.
First, thanks a lot for your answer.
I understand each case is different.
I though my case was very close to yours so i will try to add more details for my case.
Maybe it is just a problem of vocabulary or other and it is easier to solve than expected.
I will start with the base : data and vocabulary :
ner.teach is not needed here cause i already have this file ?
That's what i understood.
2 - Here is my file with my revision data (I hope my vocabulary is good here)
it is a file with simple text phrase and that's all like :
Apple cherche a acheter une startup anglaise pour 1 milliard de dollard.
San Francisco envisage d'interdire les robots coursiers.
Londres est une grande ville du Royaume-Uni.
L’Italie choisit ArcelorMittal pour reprendre la plus grande aciérie d’Europe.
La France ne devrait pas manquer d'électricité cet été,même en cas de canicule.
Nouvelles attaques de Donald Trump contre le maire de Londres.
Qui est le président de la France ?
As you can see, phrases are very simples, and results of NER engine are good.
They are my data to avoid "catastrophic forgetting problem".
That's right too ?
3 - here is my file to learn to identify my materials (almost 2000 phrases) :
extracted from wikipedia.fr.
Les Romains sont les plus anciens utilisateurs de béton connus à ce jour.
En 1849, le mariage de deux matériaux très utiles, l’acier et le béton, a donné lieu au béton armé.
Le béton est avant tout utilisé pour la construction résidentielle en Amérique du Nord, plus particulièrement pour les fondations, qui soutiennent le reste de la structure.
Les bois, la chaux, les sables, les mortiers, les pièces de fer, le plomb, les verres, les terres cuites architecturales, etc. sont de véritables « archives monumentales » que les sciences dites dures (géologie, sciences de la nature, physique, chimie …) et les analyses techniques, notamment celle des traces de production et de montage, permettent de décrypter.
Am i doing mistakes until here ?
4 - so, if i don't need terms.teach recipe, i can start with ner.teach to add annotations with prodigy GUI and ner.make-gold to correct :
if my vocabulary is good, annotations are the result in jsonl file, right ?
So, do you think i need more materials in my file ? All of existing materials ?
Yes. It's my "learning" file.
I annotate my materials with this file and i correct mistake with ner.make-gold. Is that correct ?
10-15 minutes max.
That's what i do or i really don't understand the system.
300 annotations for 200 phrases, that's meaning there are some phrases with 2 or more differents entites.
do you want to say I have to invent 2000 sentences by myself because those of wikipedia are not adapted?
it would be really strange no ? But not impossible.
It is even a good result, the big problem is the others words. My materials are well detected but there is no sense for the others words.
all of words are detected like person or loc etc but my materials. That's the problem at the moment (since 5 months).
To conclude, i tried to avoid forgetting problem to adding some text with good tags in my dataset before doing ner.teach on my "learning file" in the same dataset, and it does not work.
My new model is using bad entites on bad words. some verbs can be a person or an organisation. It makes no sense. I think it's because the model has forgotten everything.
while process is very simple in your videos
What i'm doing wrong ?
EDIT : In fact, i try to use peudo-rehearsal noticed here : Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP · Explosion.
so if i understand correctly, i have to add some text with entities well detected by basic model (fr) before adding some text and his entities added manually (or semi manually with prodigy).
Is this the good process ?
EDIT 2 : Today, my data gives me 0.86 accuracy. It's a very good result for my tag. But problem is still the same.
All the others words are annotated like hell.
Thanks, I think there's been some terminology confusion. In particular it's great to be clear that the materials list is a patterns file. 83 examples in that should be fine.
That's not correct -- ner.teach is for training the model in context. The patterns file just finds all examples of these phrases. It makes a good starting point for suggesting examples to correct in ner.teach
I think we have a difficulty communicating about these things, that's making the task much harder. Let's settle on this vocabulary:
Entity type: The category of thing you want to tag
Entity mention: An example of an entity type, in context. E.g. "Apple and Amazon are companies" contains two entity mentions.
Entity term: A phrase that is often an entity mention, depending on context. E.g. "Apple" is often a mention of a company, but not always.
Patterns file: A list of match rules, to help you find entity mentions. Can be built from a list of entity terms using the terms.to-patterns recipe.
Entity recognition: The task of identifying entity mentions in text.
NER model: A function that tags entity mentions in text.
NER annotation: The task of labelling entity mentions in text.
You start out with an NER model that identifies four entity types: PER, LOC, ORG, MISC. You want an NER model that can identify five entity types: PER, LOC, ORG, MISC, MATERIAL. The model needs examples of all 5 entity types. Prodigy supports multiple ways of annotating those examples:
You can simply feed text through ner.manual
You can feed text through ner.make-gold, and correct previous predictions
You can say yes or no to individual predictions using ner.teach
Previous predictions can be added to text using a patterns file, a statistical model, or some mix of the two.
If the previous model for PER, LOC, ORG and MISC is good enough, you might be able to assume all its annotations are correct, and just add them all to the training data. If it's not so good, you'll want to correct them, probably with the ner.make-gold recipe.
Once you have a dataset with examples of all of your entity types, you can use ner.batch-train to train your model. You should be able to get some initial results with only 2000 sentences, but you'll need many more sentences to produce a high quality system. Fortunately, annotating sentences with Prodigy doesn't take very long.
If you can produce a large dataset with correct annotations for all 5 entities you're interested in tagging, you'll then be able to train any entity recognition model on the data --- using spaCy, or any other existing system.
Thanks a lot for your answer. I will investigate around this and i will give you some feedback as soon as possible.
Thanks for taking your time for me.