Hi @tomw!
Likely a base model will help by providing word vectors but you'll want to turning off the ner
component as your ner
model will be trained from scratch. If you don't turn off the ner
then you will be adding your new entity to the existing ner
entities.
This post shows how to do it (fyi some syntax has changed but general idea is the same):
nlp = spacy.load("en_core_web_sm")
nlp.remove_pipe("ner") # can remove other not used components too
nlp.to_disk("en_core_web_sm_without_ner")
This also solves your second question because by using to_disk()
, you can now call this model like any other model for train
or ner.correct
:
prodigy ner.correct gold_ner en_core_web_sm_without_ner ./news_headlines.jsonl --label PERSON,ORG
As @koaning mentioned, you do have a choice between which base models (word vectors) where speed vs. accuracy may come into play.
If you use en_core_web_sm
, this will be the fastest/most compact, but have lower performance. Alternatively en_core_web_trf
will give you the greatest accuracy, but you should be cautious as putting transformers into production can be challenging (e.g., need GPU).
One compromise could be to use en_core_web_lg
model like in the post above. Like the small model, it is very fast but has better performance than the small model due to a larger set of word vectors. Given the larger set of word vectors, the large model is larger in size (382 MB). Ideally, you can see our experiments with different NER models and you could mimic do your own experiments to determine which performs best on your model.
Let me know if this helps or if you have any other questions!