I have a question about how to best incorporate original training data to avoid catastrophic forgetting. I’ve been tweaking spaCy NER’s GPE tag to better pick up multi-word or hyphenated place names (which come up a lot in Spanish and Arabic names) and on some kinds of short text. After a thousand examples or so, the quality on those improves, but really falls apart on other place names. I’ve licensed the OntoNotes corpus and would like to be able to use its annotations to remind spaCy what other GPEs look like. I can think of two ways to do this:
- convert OntoNotes sentences into Prodigy’s format and load some of it into the annotations DB with
- export Prodigy’s annotations to the spaCy training format, intermingle with OntoNotes in spaCy format, and train using spaCy.
Do you have advice on which one makes more sense? Can spaCy’s training work on incomplete (non-gold) sentence annotations like the ones Prodigy produces? Is there a Prodigy format --> spaCy format converter?
I think the question applies for pseudo-rehearsal as well, since you’d need some way of intermingling the data.
Thanks for any advice you have!