"Negative" pattern matching (RegEx)

jsnleong · July 10, 2019, 2:30am

Hi,

I am currently trying to workaround constructing my pattern file. My intention is this.. I would like to avoid matching a specific pattern. Can this be done?

[{label: "CONDITION"}, {"pattern": [{"TEXT: {"REGEX": "^(?!no)"}}, {"LOWER": "back"}, {"LOWER": "pain"}]}

What I am essentially trying to achieve is:
Match spans of text that corresponds to " XXXXX back pain ", where XXXXX is NOT the word "no". So, in essence XXXXX could be possibly anything like "has", "got", "have", etc...

I would like Prodigy to filter these spans to present only such patterns. Is this form of "negation" regex possible in Prodigy?

Thanks!

ines · July 10, 2019, 8:35am

spaCy’s rule-based Matcher now supports a NOT_IN attribute – so instead of your regex, you could do something like "TEXT": {"NOT_IN": ["no", "not"]}. Or even "LOWER" instead of "TEXT", to make it case-insensitive.

If you do end up finding that you need more complex regular expressions and match logic, you could also consider implementing your own regex matcher that extracts spans from your text and presents them for annotations (see the post I linked here). So basically, don’t use spaCy’s token-based Matcher and re.finditer instead. The code should be pretty straightforward, because all you need are the start and end character offsets of the match – and you’ll be able to get that easily from your regex matches.

mitramir55 · November 5, 2021, 8:52am

Hi @ines,

On a similar topic, I'm trying to find truncated passive voices - passive verbs that don't have "by"/agent in them like in: the painting was drawn and not the painting was drawn by her).

So I created the following rule:

    passive_rule_1 = [
        {"POS":"AUX", "DEP": "auxpass", "OP":"+"},
        {"POS":"VERB", "TAG":"VBN"},
        {"LEMMA":{"NOT_IN" : ['by']}},
    ]

I can say that this works and detects spans of passive voice; however, it always requires some text to be after the verb (the second pattern). For example, if I process this sentence: "He was pushed.", the matcher will find a match; however, if I give it "He was pushed" (without a dot at the end), the matcher won't find anything, which is incorrect. I believe the matcher should realize if a rule has NOT_IN:... or OP: ! in its last pattern and include matches up until the last pattern.

I know this might sound trivial, but it actually affects the accuracy of matches we make while processing corpora.

Thanks.

Topic		Replies	Views
a question about regular expression usage , spacy , solved	5	1008	December 5, 2022
Lookingaround with Prodigy / Spacy matcher semantics usage , spacy	2	524	October 7, 2019
patterns using regex or shape usage , spacy	13	3755	March 5, 2018
Bootstrapping terms with pattern file usage	7	1495	July 9, 2019
Match patterns without creating huge files usage , spacy , solved	5	1135	March 21, 2019

"Negative" pattern matching (RegEx)

Related topics