Pattern files for textcat.teach

I just ran your example with only one small modification: instead of the vectors, I used the en_core_web_sm model. I also only used a single instance of each of the words, because Prodigy will filter out duplicates anyways. The predictions were obviously random, because the model doesn't know testlabel yet – but I saw all terms, followed by a pattern match of "sky" :thinking:

I spent a lot of time reproducing this and trying to get to the bottom of what could be happening here – it's kinda tricky, because as you can see from the code, the implementation is pretty much identical to the one in ner.teach.

The most likely explanation imo is that depending on the model state and the exponential moving average of the score in the prefer_uncertain sorter, the pattern matches are filtered out. This would also explain why this behaviour has been difficult to reproduce and only occurs sometimes in certain situations.

So for cases like that, we could offer an option to only partially apply the sorter to the stream, or, more generally, come up with an API that would allow examples in the stream to not be sorted or filtered, regardless of their score.

By default, Prodigy will show you the model's predictions and the pattern matches (and won't prioritise one or the other). So it's possible that your first see the model's suggestion and then the same text again, because a pattern was matched on the same text.

If you're dealing with rare labels and a large corpus, it might make sense to divider the bootstrapping process into two steps: use a simple matcher recipe with no model in the loop first to select enough positive examples for the label (or pre-select all matches from your stream, export them to a file and then load it into Prodigy). This way, you'll only see the matches and can work through them quickly. You can then use that data to pre-train the model, and use textcat.teach to improve it in the loop. This also makes the process more predictable: if the textcat.teach session doesn't produce good results, you can go back to the previous step, add more initial training examples via patterns and then repeat the process.