spaCy’s rule-based Matcher
now supports a NOT_IN
attribute – so instead of your regex, you could do something like "TEXT": {"NOT_IN": ["no", "not"]}
. Or even "LOWER"
instead of "TEXT"
, to make it case-insensitive.
If you do end up finding that you need more complex regular expressions and match logic, you could also consider implementing your own regex matcher that extracts spans from your text and presents them for annotations (see the post I linked here). So basically, don’t use spaCy’s token-based Matcher
and re.finditer
instead. The code should be pretty straightforward, because all you need are the start and end character offsets of the match – and you’ll be able to get that easily from your regex matches.