Currently my team is building a multi-label classifier. We first tagged examples using textcat.manual with a multiple choice interface. After collecting a bunch of annotations, we trained a model based on a pretrained transformer from simpletransformers. To do so, we turned the accepted answers into a vector of 0 or 1. Model trained well and was doing a great deal better then the model trained with Prodigy directly.
Now we want to annotate more examples and use a binary approach for this. And we have a great deal of imbalance between the labels in our existing training set. We use the trained model to get predictions, then sort them such that the labels which are underrepresented in our dataset appear first. We then used the mark recipe with the source data sorted outside of Prodigy.
I do have a conceptual question now:
To train with simpletransformer, I need to present the labels as a vector of floats. When I accept an answer for one label now, I can set the value for this label in the vector to 1. When I reject this label, set it to zero. But what should I do with the other labels in this example?
I am thinking about two options:
Keep them with the predicted probability from the model
Set them to zero
My preference would be 1) because otherwise I might tell the model a label does not apply when in fact it could. Would love to hear your opinions on that.
Thanks for the kind words, and I'm glad you've been able to set everything up successfully! Your question makes a lot of sense.
I've thought a lot about this question when doing the binary interfaces in Prodigy. I think there is some literature that calls this type of scenario "bandit supervision", and it is also called partial supervision.
It's clearest to think in terms of what gradient of the scores we want to be used to update the model. There are two choices that make sense in my opinion:
If the label is missing, the gradient is zero
If the label is missing, label it using a previous model
With approach 1, we are telling the loss function that we are indifferent to the output of this neuron -- we don't care, any value is equally good. This is of course not true: we do have an opinion, and we can say that out of all possible scores, scores which are closer to the previous model's output are probably better than most alternatives. On the other hand, we're hoping to do better than the previous model on average -- so this constraint may work against us.
Setting the gradients to zero is bad if you are going to update unevenly, and have some classes consistently missing or underrepresented in your updates. If your missing values are sampled randomly, then setting the gradients to zero will be good.
What we want to avoid is a "catastrophic forgetting" problem where we stop supervising for some of the labels. If you do that, then the model will converge to a solution with low accuracy on those classes, because you've told it you don't care about them anymore, even though you actually do.
It totally makes sense to think about what happens to the gradients under different scenarios. In our project we definitely have uneven updates, as we focus on a more even distribution of labels in our dataset. So I will try with your option 2.
A different question but related to the same project:
Now that I have labelled more data, I need to update my training and test sets. But when I change the test set, the model performance is not comparable anymore. Do you have any advice or reading tips for how to manage this situation?
The (kind of annoying) reviewer 2 advice in this scenario is to run some experiments with the new training set and both the old and new test set. You could also run experiments with the old training set and the old and new test sets.
This isn't always that helpful, if you have a better match between the train/test sets then you might find the matching pairs do better. If you're still in the exploratory phase of your project you should probably just move on and use the new test set --- there's probably not much knowledge you're trying to preserve from the data continuity.
This is what I ended up doing here. Took the old and new annotations and threw them all together, then did a new random train-test split. After training we then evaluated the model on unseen data by manually correcting predictions with the choice interface. Worked quite well so far!
I found a nice trick when I prepared the data for annotation:
I used the model scores to set the alpha for the background color for each option. So the buttons are colored more or less redish depending on the models predicted probabilities. So we have a visual clue for the annotator as to what the model thinks about this example.