Hi Matt, Ines,
Have just started using prodigy, and it seems to be working well so far.
I'm looking at analysing English translations of Arabic texts and need to improve NER detection on many Arabic terms. For example, 'Islam' is often categorised as an 'ORG' when it should be 'NORP', and many of the names associated to 'PERSON' require improvement.
When there are numerous NER categories to be taught, do you recommend doing a small number and numerous passes over the dataset, or all of them together?
So far, I've chosen the annotating all together with the following workflow:
The NER categories that require improvement are, 'PERSON, NORP, FAC, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, LAW'. In the dataset are 4104 sentences.
The recipe I've been using is as follows:
-
prodigy ner.manual text_terms en_core_web_md full_text.jsonl --label "PERSON, NORP, FAC, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, LAW"
-
prodigy ner.teach text_terms en_core_web_md full_text.jsonl -- label "PERSON, NORP, FAC, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, LAW"
-
prodigy ner.batch-train text_terms en_core_web_md --output C:/Users/Steve/.prodigy/text_term_model/ --eval-spli 0.8 --label "PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LAW"
This workflow gives the following results:
Correct 1557
Incorrect 403
Baseline 0.649
Accuracy 0.794
Thank you,
Steve