textcat.teach showing same text twice (and not using active learning?)

ines · August 11, 2018, 4:55pm

Okay, so this confirms that what you're seeing in Prodigy is consistent with the model. When you load in the data, Prodigy will use your model to score the examples – for NER, this is a little more complex, since there are so many options. But for text classification, all we need to do is check the doc.cats for the respective label. That score is the same value displayed with the annotation task.

So in this example, the model predicts 0.22 for zorg and around 0.06 and lower for everything else... which seems a bit strange? But it really depends on the training data. Maybe your model just hasn't seen many similar texts? Or maybe you had examples of medical cannabis (zorg = care as in health care, right?) but none of illegal weed plantations? Or maybe something did go wrong, and the predictions make no sense at all.

But this definitely explains what you're experiencing in textcat.teach: all predictions are low, so Prodigy starts by suggesting something, to see where it leads. Have you tried annotating a few batches (like 20-30 examples)? Do you notice any changes in the scores? Maybe it was just a few examples, and the scores will adjust after a few more updates. Maybe not, and in that case, the solution might be in the model training and architecture (as discussed in the other thread).

Yeah, we went back and forth on that decision and it wasn't an easy one to make. I definitely see your point. In the end, we went with the more conceptual view that it'd be dangerous for Prodigy to make those kinds of assumptions quietly and behind the scenes. Even at the moment, a duplicate question is actually kind of difficult to define outside of the active learning-powered recipes with binary feedback. There's a related discussion in this thread where I talk about some of the problems and potential solutions for manual annotation recipes like ner.manual.

Topic		Replies	Views
Is textcat.teach (as out-of-the-box) appropriate with multilabel tasks? textcat , solved	4	275	June 28, 2022
textcat.teach with multiple choice interface? usage , textcat	9	1292	November 3, 2020
Active learning for a multilabel text classifer textcat	1	1096	December 14, 2017
Yes/no annotations with textcat.manual usage , textcat , solved	3	642	December 21, 2020
Text classification with multiple exclusive labels and unbalanced classes usage , textcat	3	593	June 3, 2020

textcat.teach showing same text twice (and not using active learning?)

Related Topics