Sorry about the late reply! I think what you’re experiencing might be whats often referred to as the “catastrophic forgetting problem”. As your model is learning about the new entity type, it’s “forgetting” what it has previously learned. In your example, this is pretty significant – but it might be because you’ve trained a completely new entity, so the only data the model is updating on is examples labelled TECH
and none of the other entity types. Because the model is never “reminded” about the other types, it overfits on the new data.
This blog post we’ve published has some more background on this, including strategies to prevent it. One approach is to mix in examples that the model previously got right and train on both those examples and the new examples.
This is pretty easy to do in Prodigy – after collecting annotations for your new TECH
entity, run the model on the same input text, and annotate the other labels. You can add all annotations to the same dataset, and then train your model with those examples. Make sure to always use input data that’s similar to what the model have to process at runtime. This might also give you a little boost in accuracy over the standard English model, because you’re also improving the existing entity types on your specific data.
Alternatively, you can also generate those examples yourself, by running spaCy over a bunch of text and selecting the entities you care about the most. (See the PRODIGY_README.html
for details on the JSONL format – all you need to do is convert the entity annotations to this format and then import them to your dataset using the db-in
command.)