Bootstrapping terms with pattern file

ines · July 8, 2019, 7:58am

Thanks for sharing more details – I think I understand it now. Under the hood, Prodigy uses spaCy's Matcher, so if the matcher doesn't match, Prodigy won't produce a match either.

spaCy's rule-based Matcher works on tokens, not the entire text. Each dictionary in the patterns represents one token. So the regex in your first pattern entry will look for one token that matches (?<=after).*$, which correctly matches the token with the text "after". If you're looking for a token "after" and one or more tokens following it, you could for instance represent it like this one the token level:

[{"TEXT": "after"}, {"OP": "+"}]

If you do want to match over the whole text using one regular expression, you could also just write your own little regex matcher that matches on the incoming text and yields examples with the matched "spans". See my comment here for an example:

Topic		Replies	Views
prelabel data using regex and how to use the active learning functionality and get the model usage , ner , spacy	3	545	October 14, 2021
(Re)using labels in patterns usage , spacy	1	316	July 21, 2021
Text Classification, Bootstrapping Error textcat	1	671	June 7, 2018
Input pattern file to terms.teach	3	318	February 24, 2023
Question about terms.patterns solved	3	227	October 17, 2022

Bootstrapping terms with pattern file

Related topics