REGEX operator in the patterns file

Hi there! First of all, thank you for the support!

In order to use the new REGEX operator in the patterns file, I would like to provide a pattern in the patterns.jsonl file.

So, let’s say I have a lot of examples where I expect a token or a sequence of tokens to be labelled with a specific label, but after a specific token (that specific token actually designates where the bank transaction occurred).

Therefore, a pattern is a simple one, using positive look behind and captures everything after.
{“label”: “MERCHANT”,“pattern”: [{“REGEX”: “(?<=IL\s).*”}]}

P.S: I have added an escaping backslash because of JSON decoder

However, after I run the ner.match recipe, every token is labelled as a MERCHANT with the pattern ID being 0 (the one I have provided).

What am I doing wrong?

Sorry if this was confusing – I assume you’re referring to the REGEX attribute proposal in this GitHub thread? This thread is still only the spec and proposal, i.e. the planned implementation. The changes will hopefully ship with spaCy v2.1.0 (since some of the changes to the Matcher internals are not fully backwards compatible). But they’re not yet available in the stable release and not implemented in the current nightly build.

Thanks! I implemented the custom recipe and adjusted it to receive the various regular expressions in order to speed-up the gathering of annotations.

Have a nice day!

1 Like

Hi @ines, I am interested in using the REGEX attribute now that it is available in spaCy. But every token in every text is still being labeled by that pattern (as described by @mmeasic) .

When can we expect the REGEX to be supported in prodigy? Or am I doing something wrong?

The matching is all done via spaCy so if you're using a recent version of spaCy that supports the REGEX operator (v2.1+), it should work as expected and described here.

Ah great, a recent version of spaCy worked!

1 Like

Hello,

first of all, thanks for the great tool.

I'm facing the same issue as @dancsalo - namely the fact that all tokens are labeled by the pattern while it should not be the case.

However @dancsalo fix does not apply to me since I'm using spaCy 2.2.3.

I have another question related to the regex in patterns. How can I use regex which requires special characters from a JSON point of view (e.g. {[)? from what I see, it produces a JSON format issue. Does it mean that it is not supported or that the syntax should be adapted - I did not find anything in the doc on that point but I might have missed it. If this is not the place for this question, let me know.

Cheers!

Thanks! From what you describe, it sounds like your pattern contains properties that aren't interpreted or that apply to all tokens, so a token pattern is either interpreted as a wildcard ({}), or it just always matches all tokens.

If you're providing them in a JSON string, you should be able to just escape them – like "\{". And if you want to use \ in your expression, that'd be \\.

Hello @ines ,

thanks a lot for your swift reply.

From what you describe, it sounds like your pattern contains properties that aren't interpreted or that apply to all tokens, so a token pattern is either interpreted as a wildcard ( {} ), or it just always matches all tokens.

I don't think that's the issue.

I have tested with a patterns.jsonl restricted to the following pattern

{"label":"AC","pattern":[{"text":{"regex":"^abc$"}}]}

Then, running:

prodigy ner.manual tmp en_core_web_sm ./data.json --label AC --patterns ./patterns.jsonl

I get all tokens highlighted:

which should not be the case, right?

My set-up is the following:

  • prodigy 1.9
  • spaCy 2.2.3
  • python3.7
  • OSX

Any idea of what is happening?

Thanks in advance, no rush of course.

Cheers!

Hello @ines,

I'm still curious to have your thoughts on the issue above if you have any time for that. Am I making a mistake somewhere?

Also, I'm very excited about the new features that the team prepared for v1.10. Thanks for making such a cool tool, that's crazy how it can boost productivity.

Best,

Cyril