Train NER model to improve existing entities spacy vs prodigy

Bob530 · December 5, 2019, 2:10pm

Hello there,
I am currently working on a NER model to improve the "ORG" and "PRODUCT" entities. My end goal is to create a production ready model. To achieve this I created a gold data set using ner.make-gold where I annotated only these 2 entities - this dataset is currently small, just to prove the concept and test the pipeline.

I saw in the prodigy documentation that you can train one or more entity label separately through for example:

prodigy ner.batch-train ner_product en_core_web_sm --output /tmp/model --eval-split 0.5 --label PRODUCT

But I did not see the same functionality for spacy train. Does spacy have a similar functionality?

Furthermore, in order to avoid the "catastrophic forgetting" problem in my current setup with spacy train, would I need to annotate the other entities (ie. PERSON, LOC, etc) in my gold data as well or can they be left out?

Thanks a lot in advance!

honnibal · December 9, 2019, 12:43pm

We don't have the ability to train only one label in spacy train currently, no. You could make the dataset so that it only includes those labels in the annotations, though.

You could consider annotating the other entities, that's probably useful. Hopefully if you're getting good accuracy on them, approving the annotations with make-gold won't be too time-consuming.

Another experimental approach you could try in spacy train is the --raw-text argument. spaCy has a feature that's a bit speculative, that I haven't really experimented with thoroughly: if you have an initial model, you can call nlp.resume_training(), and afterwards call the nlp.rehearse() method with a batch of documents. This will run an original copy of the model over the documents, and train the current model to replicate the original model's predictions. This should mitigate the catastrophic forgetting problem.

The --raw-text argument takes a jsonl file, where each line should have a key text with some text content. You could also try adding calls to nlp.rehearse() to a Prodigy recipe if you wanted to try out the approach within Prodigy.

Topic		Replies	Views
Add more 3 new entity type usage , ner	4	647	November 1, 2019
Train multiple NER from a blank FR model using fastext vectors usage , ner , spacy	12	857	March 24, 2020
Training few new entities: Result very low usage , ner , spacy	3	17	January 29, 2025
Prodigy ner.batch-train vs Spacy train usage , spacy , best-practices	13	3498	June 2, 2020
Reproducing prodigy ner.batch-train in spacy: cross-validation results and outputted model usage , ner	3	1878	October 5, 2018

Train NER model to improve existing entities spacy vs prodigy

Related topics