I am currently trying to train a NER model to recognise a "PERSON" name entity. I understand that the current spacy language model (eg. en_core_web_lg) has a trained "PERSON" entity.
However, simply using the pre-trained NER is not good enough, due to certain context of how my notes are structured, it may not capture certain Asian names, or it erroneously tags irrelevant keywords, etc.
I would like to improve the existing model, and my approach is as follows. Pls correct me if I am going in the wrong direction.
-
I created an EntityRuler to detect certain full-name keywords, and insert them into the pipeline before the 'ner' component
-
I exported the model, and ran the ner.make-gold recipe using the exported model as the baseline
a) Does the statistical model pick up anything from the EntityRuler? -
After correcting some wrong tagging made by model, I saved my annotations and batch-trained my annotations.
a) Should I train my annotations with a fresh blank model? Or should I train it on the existing exported model?
After which, I realised there are some annotations wrongly tagged in the trained model, so I decided to use the ner.make-gold recipe again. However, when I use the recipe against the trained model, I realised that I have to start annotating from the beginning of my dataset again. Why is that happening? This is not the case when I used back the exported model (from step 2), it continues from where it stopped.
Kindly advice.. thanks!