How do I switch the NER to the Wikipedia scheme?


I am interested in using the less-granular Wikipedia scheme to identify entities in SpaCy.

Is there a way to switch the NER model quickly?


If you’re working with a model that’s pre-trained using a different label scheme, changing the entire scheme is pretty difficult. After all, the model’s current scheme is what all its weights are based on.

If you want to convert existing predicted entities to the simpler scheme, you could write a wrapper around the doc.ents that checks if an entity label is PERSON and outputs PER instead, LOC for LOCATION and GPE and MISC for everything else. If you want to do this more elegantly, you could also write a function that outputs an iterable of Span objects, just like the doc.ents, and even make it available as a custom extension attribute like doc._.wiki_ents or something like that.

However, if you actually want to update the model using the new labels, it’d be very difficult to teach the model that all of these other entity types it’s learned to predict and has weights for are suddenly now MISC. Or even just changing the label ID PERSON to PER. So if that’s what you want to do, you’re probably much better off just training a new model from scratch. If you have access to the Onto Notes 5 corpus which the English models were trained on (requires a commercial license), this would be easier, because you could run a search and replace over the corpus and swap out the existing labels for the simpler ones. Otherwise, you’d have to create a new corpus from scratch, which you can certainly do with Prodigy – but it’s definitely going to take some work, and I’m not sure it’ll be worth it.

Thank you for the prompt response Ines. In that case I would leave the model as it is for now.