I am currently working on a NER model to improve the "ORG" and "PRODUCT" entities. My end goal is to create a production ready model. To achieve this I created a gold data set using ner.make-gold where I annotated only these 2 entities - this dataset is currently small, just to prove the concept and test the pipeline.
I saw in the prodigy documentation that you can train one or more entity label separately through for example:
prodigy ner.batch-train ner_product en_core_web_sm --output /tmp/model --eval-split 0.5 --label PRODUCT
But I did not see the same functionality for spacy train. Does spacy have a similar functionality?
Furthermore, in order to avoid the "catastrophic forgetting" problem in my current setup with
spacy train, would I need to annotate the other entities (ie. PERSON, LOC, etc) in my gold data as well or can they be left out?
Thanks a lot in advance!
We don't have the ability to train only one label in
spacy train currently, no. You could make the dataset so that it only includes those labels in the annotations, though.
You could consider annotating the other entities, that's probably useful. Hopefully if you're getting good accuracy on them, approving the annotations with
make-gold won't be too time-consuming.
Another experimental approach you could try in
spacy train is the
--raw-text argument. spaCy has a feature that's a bit speculative, that I haven't really experimented with thoroughly: if you have an initial model, you can call
nlp.resume_training(), and afterwards call the
nlp.rehearse() method with a batch of documents. This will run an original copy of the model over the documents, and train the current model to replicate the original model's predictions. This should mitigate the catastrophic forgetting problem.
--raw-text argument takes a jsonl file, where each line should have a key
text with some text content. You could also try adding calls to
nlp.rehearse() to a Prodigy recipe if you wanted to try out the approach within Prodigy.