Based on the sample PhraseMatcher code in the (link), I have extended the GPE entity category to include some of the Australian Suburbs & States.
After running nlp.add_pipe(entity_matcher, after='ner') and adding these location names into the pipeline, does the outputted model (nlp.to_disk()) capture these new locations?
An additional question: From the best practice post, I found that
PhraseMatcher before NER would transform matched words/phrases into an entity and NER will not touch it afterwards.
So what would be the difference if I add the entity_matcher after ner? E.g. nlp.add_pipe(entity_matcher, after='ner')
The model you're saving out here will preserve the original pipeline β e.g. ['ner', 'entity_matcher'] β but it won't actually include your custom code. This is important, because theto_disk/from_disk methods should never quietly save and eval (!) arbitrary code. They only save out data.
So when you're loading your model back in, you'll need to make sure to add an entry to the Language.factories (see here) that tells spaCy how to initialize the pipeline component entity_matcher.
Alternatively, you can also include your custom component with your model, by turning it into a Python package (that can ship code). My comment on this thread explains this in more detail:
In that case, it'd still add the entities to the doc.ents β however, you'd have to take care of reconciling duplicates and overlapping matches. By definition, one token can only be part of one entity, and you'd have to decide which entities should take precedence (the ones set by the statistical NER or the ones set by your custom component). This depends on your use case, the specific entities etc.
Since rule-based NER is something people are really interested in, I've written a simple built-in component for spaCy v2.1.x (currently available for testing as spacy-nightly). You can check out the code and see what it does in addition to just adding the entities (in order to make it interoperate with the statistical entity recognizer):
You can check out the EntityRuler code here β the way it works is very similar, but its API is closer to the other matcher APIS, and it has a few more methods (checking if a pattern exists, exporting/importing patterns to and from JSONL etc.). It also lets you set an argument to overwrite existing entities (if the component is added after the regular entity recognizer, for instance) and handles overlapping matches (by only selecting the largest span β since a token can only be part of one entity, so entity spans can't overlap).