Hello there,
I am currently working on a NER model to improve the "ORG" and "PRODUCT" entities. My end goal is to create a production ready model. To achieve this I created a gold data set using ner.make-gold where I annotated only these 2 entities - this dataset is currently small, just to prove the concept and test the pipeline.
I saw in the prodigy documentation that you can train one or more entity label separately through for example:
But I did not see the same functionality for spacy train. Does spacy have a similar functionality?
Furthermore, in order to avoid the "catastrophic forgetting" problem in my current setup with spacy train, would I need to annotate the other entities (ie. PERSON, LOC, etc) in my gold data as well or can they be left out?
We don't have the ability to train only one label in spacy train currently, no. You could make the dataset so that it only includes those labels in the annotations, though.
You could consider annotating the other entities, that's probably useful. Hopefully if you're getting good accuracy on them, approving the annotations with make-gold won't be too time-consuming.
Another experimental approach you could try in spacy train is the --raw-text argument. spaCy has a feature that's a bit speculative, that I haven't really experimented with thoroughly: if you have an initial model, you can call nlp.resume_training(), and afterwards call the nlp.rehearse() method with a batch of documents. This will run an original copy of the model over the documents, and train the current model to replicate the original model's predictions. This should mitigate the catastrophic forgetting problem.
The --raw-text argument takes a jsonl file, where each line should have a key text with some text content. You could also try adding calls to nlp.rehearse() to a Prodigy recipe if you wanted to try out the approach within Prodigy.