a question about regular expression


I am using regular expressions in the pattern file to match some strings in Prodigy. Some regular expressions work well, but some of them did not work.

For example, the following pattern can not match a string "4 cm"

{"label": "magnitude", "pattern":[{"TEXT":{"REGEX": "\d{1,4}\s?(mm|cm|ml|cc)"}}]}

Could you please give me some suggestions?


Hi! The problem here is that your pattern describes one token that matches your regular expression – however, this will never be true, since the string "4 cm" will be split into two tokens, ["4", "cm"].

One option would be to rewrite your pattern to reflect the two tokens you're looking for, for instance:

[{"TEXT": {"REGEX": "\d{1,4}"}, {"TEXT": {"IN": ["mm", "cm", "ml", "cc"]}}}]

You could also use other token attributes here, e.g. spaCy's LIKE_NUM, which would return True for tokens resembling a number, so your pattern would match "4 cm" or "seven mm".

If you want to match regular expressions over your whole text, another approach is to use your own function to set the "spans" on the incoming examples in a custom recipe based on the matches produced by your regular expressions. In this case, you'd just need to make sure that your matches don't contain overlaps and that they refer to valid token boundaries, e.g. using spaCy's Doc.char_span (if your goal is to train a named entity recognizer).

def match_regex_pattern(stream):
    for eg in stream:
        spans = []
        for match in re.finditer(YOUR_EXPRESSION, eg["text"]):
                start, end = match.span()
                spans.append({"start": start, "end": end, "label": YOUR_LABEL})
        # TODO if relevant: filter overlaps and check if spans map to valid token boundaries
        eg["spans"] = spans
        yield eg