REGEX operator in the patterns file

Hi there! First of all, thank you for the support!

In order to use the new REGEX operator in the patterns file, I would like to provide a pattern in the patterns.jsonl file.

So, let’s say I have a lot of examples where I expect a token or a sequence of tokens to be labelled with a specific label, but after a specific token (that specific token actually designates where the bank transaction occurred).

Therefore, a pattern is a simple one, using positive look behind and captures everything after.
{“label”: “MERCHANT”,“pattern”: [{“REGEX”: “(?<=IL\s).*”}]}

P.S: I have added an escaping backslash because of JSON decoder

However, after I run the ner.match recipe, every token is labelled as a MERCHANT with the pattern ID being 0 (the one I have provided).

What am I doing wrong?

Sorry if this was confusing – I assume you’re referring to the REGEX attribute proposal in this GitHub thread? This thread is still only the spec and proposal, i.e. the planned implementation. The changes will hopefully ship with spaCy v2.1.0 (since some of the changes to the Matcher internals are not fully backwards compatible). But they’re not yet available in the stable release and not implemented in the current nightly build.

Thanks! I implemented the custom recipe and adjusted it to receive the various regular expressions in order to speed-up the gathering of annotations.

Have a nice day!

1 Like

Hi @ines, I am interested in using the REGEX attribute now that it is available in spaCy. But every token in every text is still being labeled by that pattern (as described by @mmeasic) .

When can we expect the REGEX to be supported in prodigy? Or am I doing something wrong?

The matching is all done via spaCy so if you're using a recent version of spaCy that supports the REGEX operator (v2.1+), it should work as expected and described here.

Ah great, a recent version of spaCy worked!

1 Like