Hello
I would like to improve the ‘ORG, PRODUCT, PERSON’ labels for the french model.
In the model (fr_core_news_sm) provided, there are only ‘ORG and PER’ and no PRODUCT.
So I need to create the PRODUCT label and improve others.
I went trough the “Improve a NER” example and several tutorials.
I began creating three patterns jsonl files (one for each label) using list of seed words.
After that, I concatenated the files into one pattern using
cat *.jsonl > all.jsonl
I created a global dataset using:
prodigy dataset my_great_fr_ds "My dataset"
After I extracted sentences that contains the seed words into a trainingset.jsonl file
What I plan to do:
Teach the model using:
prodigy ner.teach my_great_fr_ds fr_core_news_sm trainingset.json --patterns all.jsonl
After that, train the model using:
prodigy ner.batch-train my_great_fr_ds fr_core_news_sm --output my_model --label "ORG, PRODUCT,PERSON" --eval-split 0.2 --n-iter 10 --batch-size 8
And continue the teach phase and retrain several time to improve the model.
Is this the best way to proceed ? Do I need to split the teaching part in 3 separate datasets (one for ORG, one for PRODUCT and one for PERSON) ? If yes, how do I merge them for the final training ?
I already trained on one label (ORG) but I’m afraid that if I don’t do PRODUCT and PERSON too before the training, the model will not be good for those 3 labels.
Do I need to teach using several batches (ie 1000 sentences per batch) and retrain the model with those batches and compare if it improves ?
Thanks for your help