Bootstrapping terms with pattern file

Hi all,

I am interested in teaching Prodigy to tag all tokens (until a punctuation) following a certain “critical” word(s).

Eg.
John is suffering from a bad stomachache after having the food at ABC restaurant.

I would like to tag the terms “having the food at ABC restaurant” to a particular entity, for example NE “Cause of Illness”.

I did the following in my pattern file:

{"label": "CAUSES", "pattern":[{"TEXT":{"REGEX": "(?<=after ).*$"}}]}

I believe this is the correct REGEX format, however when I feed it to Prodigy ner.teach, it does not seem to work that well… It only highlights “after” in all my documents, instead of the intended terms AFTER the term “after”.

What seems to be the problem? Thanks!

Hi @jsnleong,

Sounds like you’re on the right path. It’s hard to say what’s wrong with your limited example setup, but a wild guess would be that prodigy might not support the Regexp you’re using. In particular, I think support for lookaround’s in regexp implementations is spotty.

Can you provide a snippet of code to run your failing example in a contrived way? Also, what version of prodigy/spacy are you running?

Yes, the syntax looks okay, so there are two things to check here:

  • Under the hood, the patterns are matched using spaCy’s Matcher. Does it work as expected if you match your pattern directly in spaCy?
  • Are you using the latest version of Prodigy with spaCy v2.1+? The REGEX option was only introduced in that version and is not supported if you’re running spaCy v2.0.

Hi @justindujardin and @ines,

Thanks for the reply.

I tried running the basic Matcher with the RegEx pattern indicated above, this is the result I received…


As expected, the match was similiar to that of Prodigy. Only matching the literal word “after”. What seems to be the problem?

In case you are wondering, this is the RegEx that matches perfectly on a RegEx Checker Tool!
3131

Kindly advice, thanks!

Thanks for sharing more details – I think I understand it now. Under the hood, Prodigy uses spaCy's Matcher, so if the matcher doesn't match, Prodigy won't produce a match either.

spaCy's rule-based Matcher works on tokens, not the entire text. Each dictionary in the patterns represents one token. So the regex in your first pattern entry will look for one token that matches (?<=after).*$, which correctly matches the token with the text "after". If you're looking for a token "after" and one or more tokens following it, you could for instance represent it like this one the token level:

[{"TEXT": "after"}, {"OP": "+"}]

If you do want to match over the whole text using one regular expression, you could also just write your own little regex matcher that matches on the incoming text and yields examples with the matched "spans". See my comment here for an example:

1 Like

Hi @ines,

Awesome, big thanks!! I see why things didn’t work out previously… :slight_smile:

After fixing the pattern file, what seems to be the difference between ner.teach and ner.match? Both seems to be performing the same action of matching, isn’t it?

Also, with regards to my previous post (Domain-specific NER project), do you have any suggestion to pre-trained healthcare models that can be imported to spaCy? How about the ones in spaCy’s universe?

Thanks once again!

Both recipes do matching, but in addition to that, ner.teach will also show you the model's predictions and update the model in the loop with your annotations. The patterns can help you get over the "cold start problem" and ensure you get to annotate enough positive examples in the beginning. Once the model has seen enough of them, it'll start suggesting entities as well, based on the previous annotations.

ner.match only performs the matching itself and doesn't do any active learning or use the model to suggest examples. It's useful if you want to go through a list of patterns and label their matches in context.

Hi @ines, thanks!

I tried your token matching attribute, but didn't work for me. Instead, I made a small change and got it to work.
[{"TEXT": "after"}, {"TEXT": {"REGEX": "[a-zA-Z0-9]*$"}, "OP": "+"}]

However, what happens now is that in Prodigy annotation server, I get matches that increment one token at a time. This is an illustration. The one in bold are the highlighted texts.

John feels sick after eating prawns at Restaurant ABC.
John feels sick after eating prawns at Restaurant ABC.
John feels sick after eating prawns at Restaurant ABC.
John feels sick after eating prawns at Restaurant ABC.
John feels sick after eating prawns at Restaurant ABC.

So what I did was, I rejected the first 4 matches, and only accepted the final one. Is this the right approach? Am I erroneously teaching the model by rejected so many examples?

Also, this is pretty time consuming. I wanted to just allow the machine to highlight the chunk of text "after eating prawns at Restaurant ABC" and from there I just simply rej/accept.

Thanks!