Textcat.teach not using the pattern file

ines · August 11, 2018, 11:18am

Hi! I think the behaviour you're seeing might be related to what I describe in this thread:

In short, if you're starting from scratch with very unbalanced classes or a very large corpus with fewer match candidates, it can happen that not enough initial matches are produced and that matches found later on are skipped due to their score.

When you run a teach recipe with patterns, Prodigy will combine the pattern matches and the model's suggestions. If no matches are found in a batch of examples, Prodigy will only yield the model's suggestions, which can be very random if the model hasn't learned anything yet.

The pattern matcher also assigns scores to the matches, based on how reliably they produce a match. This makes sense for lots of patterns and matches, because you still want to focus on the most important examples – but in other cases with low match density, this can cause the active learning algorithm to actually skip the few existing matches.

I've discussed some solutions in the thread linked above – for example, for the next release, we'll be updating the logic used to sort and merge the matcher and model, to prevent matches from being skipped. In the meantime, you could try and use a separate step to bootstrap the model. The main problem that the patterns are trying to solve is the cold start: you need enough initial training examples for the model to make meaningful suggestions. So you could first find the matches and bootstrap the initial training set, pre-train the model with that data and then use textcat.teach to improve it. One idea could be to repurpose the ner.match recipe and add a "label": "NEG" to the selected examples. You could also check out the recipe source and write your own, or implement a different matching logic with regular expressions etc.

Topic		Replies	Views
Pattern files for textcat.teach usage , textcat	20	3747	July 6, 2018
textcat.teach - Patterns not filtering Label enhancement , textcat , done , solved	8	744	January 11, 2019
textcat.teach repeatedly annotating the same text, not annotating entire text at once usage , textcat	1	623	November 22, 2019
Seeds not recognized by textcat.teach textcat , solved	10	3274	January 23, 2019
Label mismatch in Pattern file and textcat.teach command textcat , solved	6	630	June 14, 2018

Textcat.teach not using the pattern file

Related topics