Okay, so this confirms that what you're seeing in Prodigy is consistent with the model. When you load in the data, Prodigy will use your model to score the examples – for NER, this is a little more complex, since there are so many options. But for text classification, all we need to do is check the doc.cats
for the respective label. That score is the same value displayed with the annotation task.
So in this example, the model predicts 0.22
for zorg
and around 0.06
and lower for everything else... which seems a bit strange? But it really depends on the training data. Maybe your model just hasn't seen many similar texts? Or maybe you had examples of medical cannabis (zorg = care as in health care, right?) but none of illegal weed plantations? Or maybe something did go wrong, and the predictions make no sense at all.
But this definitely explains what you're experiencing in textcat.teach
: all predictions are low, so Prodigy starts by suggesting something, to see where it leads. Have you tried annotating a few batches (like 20-30 examples)? Do you notice any changes in the scores? Maybe it was just a few examples, and the scores will adjust after a few more updates. Maybe not, and in that case, the solution might be in the model training and architecture (as discussed in the other thread).
Yeah, we went back and forth on that decision and it wasn't an easy one to make. I definitely see your point. In the end, we went with the more conceptual view that it'd be dangerous for Prodigy to make those kinds of assumptions quietly and behind the scenes. Even at the moment, a duplicate question is actually kind of difficult to define outside of the active learning-powered recipes with binary feedback. There's a related discussion in this thread where I talk about some of the problems and potential solutions for manual annotation recipes like ner.manual
.