This might be a really dumb question. I'm going to be working on a really large data set with quite a few labels. It's easier to focus on one label at a time vs 20 and potentially missing something. Is there any downside to only annotating one label at a time in a data set vs all at once? I can't think of anything off the top of my head that doing it this way would cause an issue but just wanted to double check my bases before I get too far down.
Hi! This is a totally reasonable question When you train a model from binary (or any annotations, really), Prodigy will merge all annotations on the same input, so you only end up with one training example per text. This means you can collect your annotations in a more fine-grained way (one dataset per label) and merge them automatically at the end.
We often recommend focusing on a smaller subset of labels, or even one label at a time. This makes it easier to focus because you don't have to keep jumping between different labels and only have to think about one concept at a time. It also helps you work against imbalanced distributions during the example selection: if you annotate all labels together and let the model select what to annotate based on the scores, you might only get to see one specific label once or twice, which is unideal.
Btw, more generally, I'd also recommend collecting at least some gold-standard annotations using a workflow like ner.correct
or ner.manual
, especially if your goal is to train a model (more or less) from scratch.
Thanks for the reply Ines, I'm essentially just using ner.teach to create a base model that I will use to assist in creating a new gold-standard data set using ner.correct.
That's a really good point about imbalanced distributions, that didn't even cross my mind.
Thanks again!