Bootstrapping terms with pattern file

Thanks for sharing more details – I think I understand it now. Under the hood, Prodigy uses spaCy's Matcher, so if the matcher doesn't match, Prodigy won't produce a match either.

spaCy's rule-based Matcher works on tokens, not the entire text. Each dictionary in the patterns represents one token. So the regex in your first pattern entry will look for one token that matches (?<=after).*$, which correctly matches the token with the text "after". If you're looking for a token "after" and one or more tokens following it, you could for instance represent it like this one the token level:

[{"TEXT": "after"}, {"OP": "+"}]

If you do want to match over the whole text using one regular expression, you could also just write your own little regex matcher that matches on the incoming text and yields examples with the matched "spans". See my comment here for an example:

1 Like