Improving on spacy's existing NER entities

Hi all, @honnibal, @ines,

I am currently trying to train a NER model to recognise a "PERSON" name entity. I understand that the current spacy language model (eg. en_core_web_lg) has a trained "PERSON" entity.

However, simply using the pre-trained NER is not good enough, due to certain context of how my notes are structured, it may not capture certain Asian names, or it erroneously tags irrelevant keywords, etc.

I would like to improve the existing model, and my approach is as follows. Pls correct me if I am going in the wrong direction.

  1. I created an EntityRuler to detect certain full-name keywords, and insert them into the pipeline before the 'ner' component

  2. I exported the model, and ran the ner.make-gold recipe using the exported model as the baseline
    a) Does the statistical model pick up anything from the EntityRuler?

  3. After correcting some wrong tagging made by model, I saved my annotations and batch-trained my annotations.
    a) Should I train my annotations with a fresh blank model? Or should I train it on the existing exported model?

After which, I realised there are some annotations wrongly tagged in the trained model, so I decided to use the ner.make-gold recipe again. However, when I use the recipe against the trained model, I realised that I have to start annotating from the beginning of my dataset again. Why is that happening? This is not the case when I used back the exported model (from step 2), it continues from where it stopped.

Kindly advice.. thanks!


I think your approach sounds pretty good. To answer your question about the EntityRuler, once the annotations are in the dataset after ner.make-gold, it won't matter whether they were initially predicted by the model or the ruler --- they'll still be in there, and the model will learn from them.

To answer your second question, you can try either: usually I recommend training from a blank model, but in your case, since you're using an existing entity type, maybe resuming training will work for you --- give it a try and see.

Finally, I think @ines's answer here will explain the situation with the feed starting from the beginning: Duplicates in ner.manual