Improve a NER on multiple labels


I would like to improve the ‘ORG, PRODUCT, PERSON’ labels for the french model.
In the model (fr_core_news_sm) provided, there are only ‘ORG and PER’ and no PRODUCT.
So I need to create the PRODUCT label and improve others.

I went trough the “Improve a NER” example and several tutorials.

I began creating three patterns jsonl files (one for each label) using list of seed words.
After that, I concatenated the files into one pattern using

cat *.jsonl > all.jsonl

I created a global dataset using:

prodigy dataset my_great_fr_ds "My dataset"

After I extracted sentences that contains the seed words into a trainingset.jsonl file

What I plan to do:

Teach the model using:

prodigy ner.teach my_great_fr_ds fr_core_news_sm trainingset.json --patterns all.jsonl

After that, train the model using:

prodigy ner.batch-train my_great_fr_ds fr_core_news_sm --output my_model --label "ORG, PRODUCT,PERSON" --eval-split 0.2 --n-iter 10 --batch-size 8

And continue the teach phase and retrain several time to improve the model.

Is this the best way to proceed ? Do I need to split the teaching part in 3 separate datasets (one for ORG, one for PRODUCT and one for PERSON) ? If yes, how do I merge them for the final training ?

I already trained on one label (ORG) but I’m afraid that if I don’t do PRODUCT and PERSON too before the training, the model will not be good for those 3 labels.

Do I need to teach using several batches (ie 1000 sentences per batch) and retrain the model with those batches and compare if it improves ?

Thanks for your help

You don't have to – you could also set --label PERSON,ORG,PRODUCT. However, we usually recommend working on the labels separately. It makes it easier to focus, since you only ever have to think about one concept at a time. It also makes it easier to run separate experiments per label, and start over if you need to. For example, maybe you realise at some point that you've been annotating PERSON slightly wrong. If you have those annotations in a different dataset, you can discard the old ones and start again. Or maybe you find out that the accuracy suddenly gets really bad once you add PRODUCT annotations. This could indicate that there's a problem with those examples – but if you have everything in one set, getting to that conclusion would be much harder. Merging several datasets into one is relatively easy – splitting one large dataset is a much more frustrating task.

Frequently training your model and comparing the results is always good, yes! You don't want to waste weeks labelling data, only to find out that your model isn't actually learning anything from it.

For each training experiment, you usually want to start with the same base model and then update with all the annotations – for example, fr_core_news_sm and your entire training set. (If you keep updating the model you've previously trained with new data constantly, you might introduce other side-effects of the frequent updating and potentially end up with confusing results.)

1 Like

Thanks a lot Ines for this answer

So if I understood well, I need to create 3 datasets (ie ds_org, ds_product and ds_person)

Then I do teaching on those 3 datasets using :

prodigy ner.teach ds_org fr_core_news_sm trainingset_org.json --label ORG
prodigy ner.teach ds_product fr_core_news_sm trainingset_prod.json --label PRODUCT
prodigy ner.teach ds_person fr_core_news_sm trainingset_person.json --label PERSON

After, I need to merge my dataset in one using 3 db-out and one db-in and launch the ner.batch-train or can I pass the three datasets directly ?

And after the training, if I want to improve my model with new examples, I relaunch the teach on the same datasets using new data (ie the following line), and I redo the merge and the ‘final’ training on fr_core_news_sm , right ?

prodigy ner.teach ds_org fr_core_news_sm anothertrainingset_org.json --label ORG

Yes, that sounds like a good workflow. And if possible, you might want to keep the merged files somewhere for reference, so you can always go back to a previous state.

You can also automate the merging process in Python btw – I've posted a little snippet here:

1 Like