Hi Ines! Thanks a lot for your help. I have managed to follow your suggestion step-by-step until the end.
I converted the data, trained a multilabel model with spacy, saved the model to disk, and used it as a pre-trained model to help me annotate with textcat.teach
for one of the labels.
I have 2 follow up questions:
-
If I generate binary annotations with
textcat.teach
and then dotextcat.batch-train
to train the multilabel model with those binary annotations, what exactly is the model being trained on? Does prodigy assume that all other labels are False? -
While doing binary annotation for LABEL_ONE, all the documents which
textcat.teach
suggests are negative examples (none of them are LABEL_ONE). I would like to know how to maketextcat.teach
suggest the observations with highest probability of belonging to LABEL_ONE so that I can generate more positive annotations? (instead of the most uncertain scores. I know there must be a lot of value in choosing the uncertain ones, but it seems that it’s not the best when you have an unbalanced multilabel dataset and you’re just getting started)
Thanks!