Is there a way I can train a model on a prodigy dataset but only on specific labels? I want to remove some entities from my model as they are no longer applicable.
Is it possible to do in prodigy as well as in spaCy?
My dataset is already split into train and test and I would like to maintain this split so I can compare the model against previous iterations
Hi! If you want to change the entity labels a model is updated with, you do have to change the data and remove them. One option would be to use Prodigy's Database
API to load your dataset, and then filter the "spans"
of each example to only includes the labels you want. You can then import the result to a new dataset.
Alternatively, you can also filter the spaCy training data directly (if you're training with spaCy v3, you can load the binary .spacy
file with the DocBin
and filter the doc.ents
of each Doc
object for both your training and evaluation set).
You can definitely do this, although your evaluation results won't necessarily be comparable. The presence and absence of a given label will also have an impact on all other labels, and the final score is calculated based on all labels. But you could still look at the more fine-grained scores for the invidual labels – if what you care about is that the instances of one label are correctly recognized and leaving out an old label causes the score of another label to go up, this is definitely a positive outcome.