NER with Gazetteer

Often, the accuracy of NER is greatly increased by using a gazetteer as an input to the model. Are there any plans on the roadmap to add this to prodigy?

The ner.teach command supports using a patterns file alongside the statistical model. This patterns file can be used for basic literal matches, or also more advanced patterns that feature quantifiers, POS tags, dependency labels, etc.

The interaction between the patterns file and the statistical model isn’t exactly the same as a gazetteer, though. The patterns file is used as a way to suggest entities that you click yes or no to. The answers to these pattern-matched entities are then used to train the statistical model. The answers also affect the match score of the pattern, which allows Prodigy to assign low scores to patterns which are usually rejected, and high scores to patterns which are usually accepted.

The overall idea here is to use the gazetteer to train the statistical model, instead of using it inside the entity recogniser. If you want a pure gazetteer entity recognition component, you can use spaCy’s Matcher or PhraseMatcher classes: https://spacy.io/usage/linguistic-features#section-rule-based-matching . You could add a matcher instance to your pipeline, before the entity recognizer, like this:


import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
nlp.add_pipe(matcher, before='ner')

You would then add patterns to the matcher that have a callback that adds them to the Doc as named entities. The subsequent statistical NER is constrained by these existing entities: it can’t propose any entities that overlap or overwrite the ones that are already set.