Improving in spacy's existing NER entities

Hi Matt, Ines,

Have just started using prodigy, and it seems to be working well so far.

I'm looking at analysing English translations of Arabic texts and need to improve NER detection on many Arabic terms. For example, 'Islam' is often categorised as an 'ORG' when it should be 'NORP', and many of the names associated to 'PERSON' require improvement.

When there are numerous NER categories to be taught, do you recommend doing a small number and numerous passes over the dataset, or all of them together?

So far, I've chosen the annotating all together with the following workflow:

The NER categories that require improvement are, 'PERSON, NORP, FAC, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, LAW'. In the dataset are 4104 sentences.

The recipe I've been using is as follows:

  • prodigy ner.manual text_terms en_core_web_md full_text.jsonl --label "PERSON, NORP, FAC, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, LAW"

  • prodigy ner.teach text_terms en_core_web_md full_text.jsonl -- label "PERSON, NORP, FAC, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, LAW"

  • prodigy ner.batch-train text_terms en_core_web_md --output C:/Users/Steve/.prodigy/text_term_model/ --eval-spli 0.8 --label "PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LAW"

This workflow gives the following results:
Correct 1557
Incorrect 403
Baseline 0.649
Accuracy 0.794

Thank you,

Steve

Hi Steve,

The ner.teach recipe works well to quickly improve a few categories, especially if you don't mind the accuracy on the other categories so much. It's a bit of a tricky approach though because the model finds it hard to learn from the binary feedback. So, sometimes it works well, but other times it struggles a bit --- and it's hard to combine with full annotations.

If you need all of the categories, you might consider the ner.make-gold recipe. This lets you correct the model's output, so I think this might be the one you want. Remember the add the --no-missing flag to the ner.batch-train command as well, to tell the model that all of the information is there.

that's really useful, thank you Matt, will let you know how I get on!