Thanks for the analysis – this makes sense and is consistent with what I suspected above: since the pattern matches also receive a score, they are filtered out if they're not considered "relevant" enough. This makes sense if there's a lot of incoming data – but not so much if you're starting from scratch. So that's definitely something we want to optimise and provide more settings for.
No, this thread describes a solution for an old version of Prodigy that didn't yet support the full highlighting for textcat
recipes and only highlighted the terms for ner
. As I mention in my comment here, this update was shipped in v1.4.0
.
If you just want to find matches in your data to pre-train the model, I would suggest repurposing the ner.match
recipe which does exactly that: it takes patterns, finds the matches and asks you for feedback.
Sorry if my description was unclear. I meant editing the data you collect afterwards to add a "label"
, so you can use the data in textcat.batch-train
. For example, once you're done with ner.match
, you can export the data:
prodigy db-out your_match_dataset > data.jsonl
Then run a quick search and replace and add "label": "NEG"
to each entry in the JSONL and add a new dataset for the converted annotations. You can then pre-train your text classification model from that:
prodigy dataset textcat_match_dataset "Converted dataset with added labels"
prodigy db-in textcat_match_dataset data_converted.jsonl
prodigy textcat.batch-train textcat_match_dataset ... # etc
Once you have a model that's learned a bit more about your "NEG"
label, you can load it into textcat.teach
and start improving the model, without the immediate need to use patterns for bootstrapping.