Textcat.teach not using the pattern file

Thanks for the analysis – this makes sense and is consistent with what I suspected above: since the pattern matches also receive a score, they are filtered out if they're not considered "relevant" enough. This makes sense if there's a lot of incoming data – but not so much if you're starting from scratch. So that's definitely something we want to optimise and provide more settings for.

No, this thread describes a solution for an old version of Prodigy that didn't yet support the full highlighting for textcat recipes and only highlighted the terms for ner. As I mention in my comment here, this update was shipped in v1.4.0.

If you just want to find matches in your data to pre-train the model, I would suggest repurposing the ner.match recipe which does exactly that: it takes patterns, finds the matches and asks you for feedback.

Sorry if my description was unclear. I meant editing the data you collect afterwards to add a "label", so you can use the data in textcat.batch-train. For example, once you're done with ner.match, you can export the data:

prodigy db-out your_match_dataset > data.jsonl

Then run a quick search and replace and add "label": "NEG" to each entry in the JSONL and add a new dataset for the converted annotations. You can then pre-train your text classification model from that:

prodigy dataset textcat_match_dataset "Converted dataset with added labels"
prodigy db-in textcat_match_dataset data_converted.jsonl
prodigy textcat.batch-train textcat_match_dataset ... # etc

Once you have a model that's learned a bit more about your "NEG" label, you can load it into textcat.teach and start improving the model, without the immediate need to use patterns for bootstrapping.