First, I am a huge fan of Prodigy. Great Work
Currently my team is building a multi-label classifier. We first tagged examples using
textcat.manual with a multiple choice interface. After collecting a bunch of annotations, we trained a model based on a pretrained transformer from simpletransformers. To do so, we turned the accepted answers into a vector of 0 or 1. Model trained well and was doing a great deal better then the model trained with Prodigy directly.
Now we want to annotate more examples and use a binary approach for this. And we have a great deal of imbalance between the labels in our existing training set. We use the trained model to get predictions, then sort them such that the labels which are underrepresented in our dataset appear first. We then used the
mark recipe with the source data sorted outside of Prodigy.
I do have a conceptual question now:
To train with simpletransformer, I need to present the labels as a vector of floats. When I accept an answer for one label now, I can set the value for this label in the vector to 1. When I reject this label, set it to zero. But what should I do with the other labels in this example?
I am thinking about two options:
- Keep them with the predicted probability from the model
- Set them to zero
My preference would be 1) because otherwise I might tell the model a label does not apply when in fact it could. Would love to hear your opinions on that.