a question about regular expression

Hi

I am using regular expressions in the pattern file to match some strings in Prodigy. Some regular expressions work well, but some of them did not work.

For example, the following pattern can not match a string "4 cm"

{"label": "magnitude", "pattern":[{"TEXT":{"REGEX": "\d{1,4}\s?(mm|cm|ml|cc)"}}]}

Could you please give me some suggestions?

Thanks.

Hi! The problem here is that your pattern describes one token that matches your regular expression – however, this will never be true, since the string "4 cm" will be split into two tokens, ["4", "cm"].

One option would be to rewrite your pattern to reflect the two tokens you're looking for, for instance:

[{"TEXT": {"REGEX": "\d{1,4}"}, {"TEXT": {"IN": ["mm", "cm", "ml", "cc"]}}}]

You could also use other token attributes here, e.g. spaCy's LIKE_NUM, which would return True for tokens resembling a number, so your pattern would match "4 cm" or "seven mm".

If you want to match regular expressions over your whole text, another approach is to use your own function to set the "spans" on the incoming examples in a custom recipe based on the matches produced by your regular expressions. In this case, you'd just need to make sure that your matches don't contain overlaps and that they refer to valid token boundaries, e.g. using spaCy's Doc.char_span (if your goal is to train a named entity recognizer).

def match_regex_pattern(stream):
    for eg in stream:
        spans = []
        for match in re.finditer(YOUR_EXPRESSION, eg["text"]):
                start, end = match.span()
                spans.append({"start": start, "end": end, "label": YOUR_LABEL})
        # TODO if relevant: filter overlaps and check if spans map to valid token boundaries
        eg["spans"] = spans
        yield eg
1 Like

I would also like to take the recipe from ner manual and pre-select for several tokens. However, I do not know exactly how I integrate the preselection in the ner.manual recipe.

https://github.com/explosion/prodigy-recipes/blob/master/ner/ner_manual.py

    def match_regex_pattern(stream):
        for eg in stream:
            spans = []
            
            for match in re.finditer(YOUR_EXPRESSION, eg["text"]):
                start, end = match.span()
                span = eg.char_span(start, end)

                if span is not None:
                    spans.append({"start": start, "end": end, "label": Test})
            eg["spans"] = spans
            yield eg

How can I insert the code snippet so that I get certain "spans" highlighted beforehand?

One method that might make it easier to do pre-selection is to do all the preprocessing upfront such that your .jsonl file only contains the items of interest.

That said, if you want items to be highlighted then using the --patterns flag in ner.manual is your best bet. You can use matcher files, as described here, which should pre-fill the items of interest.

1 Like

Hi! What if I want to use regexes only to suggest spans, but I want to label them myself? Right now, patterns must contain a label, and it is not very suitable for me. Thank you.

Hi Maria.

Just so I understand your problem a bit better, are you interested in highlighting a substring and then having a Prodigy interface attach a label to it? That sounds like you might want to use a classification interface, which I wouldn't mind making a demo for, but I'd like to confirm that I understand your problem a bit better before I do. Could you share the Regex that you had in mind? Also, what task are you hoping to solve?