"Negative" pattern matching (RegEx)

Hi,

I am currently trying to workaround constructing my pattern file. My intention is this.. I would like to avoid matching a specific pattern. Can this be done?

[{label: "CONDITION"}, {"pattern": [{"TEXT: {"REGEX": "^(?!no)"}}, {"LOWER": "back"}, {"LOWER": "pain"}]}

What I am essentially trying to achieve is:
Match spans of text that corresponds to " XXXXX back pain ", where XXXXX is NOT the word "no". So, in essence XXXXX could be possibly anything like "has", "got", "have", etc...

I would like Prodigy to filter these spans to present only such patterns. Is this form of "negation" regex possible in Prodigy?

Thanks!

spaCy’s rule-based Matcher now supports a NOT_IN attribute – so instead of your regex, you could do something like "TEXT": {"NOT_IN": ["no", "not"]}. Or even "LOWER" instead of "TEXT", to make it case-insensitive.

If you do end up finding that you need more complex regular expressions and match logic, you could also consider implementing your own regex matcher that extracts spans from your text and presents them for annotations (see the post I linked here). So basically, don’t use spaCy’s token-based Matcher and re.finditer instead. The code should be pretty straightforward, because all you need are the start and end character offsets of the match – and you’ll be able to get that easily from your regex matches.

Hi @ines,

On a similar topic, I'm trying to find truncated passive voices - passive verbs that don't have "by"/agent in them like in: the painting was drawn and not the painting was drawn by her).

So I created the following rule:

    passive_rule_1 = [
        {"POS":"AUX", "DEP": "auxpass", "OP":"+"},
        {"POS":"VERB", "TAG":"VBN"},
        {"LEMMA":{"NOT_IN" : ['by']}},
    ]

I can say that this works and detects spans of passive voice; however, it always requires some text to be after the verb (the second pattern). For example, if I process this sentence: "He was pushed.", the matcher will find a match; however, if I give it "He was pushed" (without a dot at the end), the matcher won't find anything, which is incorrect. I believe the matcher should realize if a rule has NOT_IN:... or OP: ! in its last pattern and include matches up until the last pattern.

I know this might sound trivial, but it actually affects the accuracy of matches we make while processing corpora.

Thanks.