Detecting Quotes using REGEX in Patterns File for NER

Hi Ines,

Is it possible to create REGEX in patterns file for detecting quotes in a text that will be used in prodigy ner.manual?

Thank you.

Hi!

You can definitely use REGEX patterns with ner.manual. For instance, the following pattern will highlight all London and Londen tokens:

{"pattern": [{"TEXT": {"REGEX": "Lond[eo]n"}}], "label": "CITY"}

I'm not sure from your post what exactly you want to accomplish, but there are a few caveats to take into consideration. The approach used above applies "token matching", which means that each {"TEXT":...} part in the pattern should match exactly one token. Quotes are often split off from the neighbouring tokens, so you'd have to specify them separately. Let's say you want to recognize 'Londen' and 'London', then you'd need to spell those three tokens out:

{"pattern": [{"TEXT": "'"}, {"TEXT": {"REGEX": "Lond[eo]n"}}, {"TEXT": "'"}], "label": "CITY"}

If you want to do string-based matching, you don't have to worry about tokenization, but you won't be able to use the REGEX operator then.

{"pattern": "'London'", "label": "CITY"}

In this case you'd need to spell out each potential literal variation, which is probably not what you want to do.

You can find more information on match patterns in Prodigy here: https://prodi.gy/docs/api-loaders#input-patterns

Hi SofieVL,

Thank you for your response.

I tried to used NER to annotate a direct quote in a document as QUOTE label using pattern.jsonl. I have create the rule like this:

{"label": "QUOTE", "pattern": [{"text": {"REGEX": ""(.*?)""}}]}

The problem is, when I run the prodigy ner.manual with the pattern file, all the direct quotes in my text/document didn't automatically label as QUOTE in the application interface.

Or, maybe there is a problem with my regex that using the quotation mark (")?

Thank you.

If I understand your regex correctly, you're trying to directly match things like "London". In my first post, I tried explaining that that often won't work, because "London" is typically split into 3 tokens by most tokenizers, splitting off the quotes into separate tokens. With Prodigy's token match patterns, you need to specify a match criterium for each token separately.

For your use-case, this would become something like:

{"pattern": [{"TEXT": "\"", "SPACY": false}, {"TEXT": {"REGEX": "[^\"]"}, "OP": "+"}, {"TEXT": "\""}], "label": "QUOTE"}

The first part,

{"TEXT": "\"", "SPACY": false}

specifies your quotation mark and says that there shouldn't be a space after it, just to avoid some obvious false positives.

The second part

{"TEXT": {"REGEX": "[^\"]"}, "OP": "+"}

matches any token, as long as there is no quotation mark in it. And the OP part says that there can be one or more of such tokens. Otherwise, you would only match single-token words in quotes like "London" and not "New York".

The last part

{"TEXT": "\""}

just says that there's another quote token at the end of your entity.

Notice that in the .jsonl files, you'll need to use \" to represent the double quotation mark.

For more details & background docs on token-based matching, see also here: https://spacy.io/usage/rule-based-matching#matcher

Hi SofieVL,

Thank you for your sugestions. I will try it and hope that it will solve my problem.