Pattern files for textcat.teach

ines · July 3, 2018, 3:19pm

I just ran your example with only one small modification: instead of the vectors, I used the en_core_web_sm model. I also only used a single instance of each of the words, because Prodigy will filter out duplicates anyways. The predictions were obviously random, because the model doesn't know testlabel yet – but I saw all terms, followed by a pattern match of "sky"

I spent a lot of time reproducing this and trying to get to the bottom of what could be happening here – it's kinda tricky, because as you can see from the code, the implementation is pretty much identical to the one in ner.teach.

The most likely explanation imo is that depending on the model state and the exponential moving average of the score in the prefer_uncertain sorter, the pattern matches are filtered out. This would also explain why this behaviour has been difficult to reproduce and only occurs sometimes in certain situations.

So for cases like that, we could offer an option to only partially apply the sorter to the stream, or, more generally, come up with an API that would allow examples in the stream to not be sorted or filtered, regardless of their score.

By default, Prodigy will show you the model's predictions and the pattern matches (and won't prioritise one or the other). So it's possible that your first see the model's suggestion and then the same text again, because a pattern was matched on the same text.

If you're dealing with rare labels and a large corpus, it might make sense to divider the bootstrapping process into two steps: use a simple matcher recipe with no model in the loop first to select enough positive examples for the label (or pre-select all matches from your stream, export them to a file and then load it into Prodigy). This way, you'll only see the matches and can work through them quickly. You can then use that data to pre-train the model, and use textcat.teach to improve it in the loop. This also makes the process more predictable: if the textcat.teach session doesn't produce good results, you can go back to the previous step, add more initial training examples via patterns and then repeat the process.

Topic		Replies	Views
Textcat.teach not using the pattern file enhancement , textcat , done	10	1936	September 20, 2022
Can we bring back --seeds for textcat.teach? textcat , solved	7	541	February 10, 2023
No tasks available in v1.10 - texcat.teach usage , textcat	4	852	June 28, 2020
Seeding text categorization with phrases textcat , done , custom	9	4233	March 21, 2018
Seeds not recognized by textcat.teach textcat , solved	10	3305	January 23, 2019

Pattern files for textcat.teach

Related topics