(Re)using labels in patterns

Hi I want to use some patterns to kickstart my NER project. In fact I have a lot of terms that I know will pretty much always be entities. It's been very easy to write patterns to catch them. However I can do even better and use some grammatical information to expand those. I'm working with German text and for example if I know "Entsorgung" is an entity I know that "Ver- und Entsorgung" will be as well. So in spacy what I would do is to first run a pattern matcher to catch the the single words and write something along the line of the one below to expand on these.

{"IS_ALPHA": true}, {"IS_PUNCT": true}, {"LOWER": {"IN": ["u.", "und"]}}, {"ENT_TYPE": "MyEnt"]}

In prodigy however this doesn't work. Firstly one cannot run two consecutive pattern matchers as the labels from the first one will be lost. Of course I can append the patterns at the end of the patterns.jsonl, however I'm uncertain whether the order in which the patterns are applied is guaranteed.

Secondly, and more importantly I don't know how I can access a previously set label. I've tried ENT_TYPE as in the example above, but this doesn't work. At this point I'm not even sure if the label is accessible for the matcher. I thought about moving all pattern matching to my spacy model, or is there a better solution?

Hi! I think once you're getting to the point where you want to implement multiple stacked matcher rules or even add custom logic in Python to determine the final matches, it probably makes sense to add your own function that calls into spaCy's Matcher directly.

Ultimately, all your custom matcher function needs to produce is a dictionary in Prodigy's JSON format with "spans" containing the start/end/label of the match: https://prodi.gy/docs/api-interfaces#ner_manual And you'll need to ensure that you don't end up with overlapping spans, e.g. by only choosing the longest match or the "best" match, based on your custom logic. You could base your custom recipe off the ner.manual script here and add a wrapper around the stream that adds the "spans" based on the Matcher: prodigy-recipes/ner_manual.py at master · explosion/prodigy-recipes · GitHub If you run add_tokens last, after the matches, it'll take care of adding all the token information so you won't have to do this manually.

In general, this should work if your model predicts the given entity type. Prodigy's built-in pattern matcher should be able to use that, since it'll process the texts with the given spaCy model you load in.

1 Like