Hi! I think your experience and intuition makes sense here – we've also found that annotating all labels together (especially for large sets) can often makes things more difficult because you constantly have to think about all labels and it makes it harder to adjust the label scheme during the development phase.
Ideally, you'd always want to be training from scratch using all annotations. The presence or absence of one label can have an impact on all other labels, so it makes sense to train your model on all labels combined. So whenever you have a new label, you add the annotations for it and then retrain on all datasets.
This souldn't be more work either – if you're using Prodigy v1.9+, the train
and data-to-spacy
recipes will take care of merging annotations from multiple datasets automatically. The merged data wll only contain each example once, and all annotations referring to that example will be merged togethet (data-to-spacy
even merges annotations of different types, like text classification and NER.
If you're annotating a new label and you're worried that there might be overlap with other labels, you could also re-annotate an existing dataset with another label. So if you have a dataset with ORG
anotations and you want to add annotations for PRODUCT
, export the data with db-out
and re-annotate it with --label ORG,PRODUCT
. You'll the the existing annotations and can adjust them if needed, and you'll be able to add new annotations for PRODUCT
. Later on, you can then train with your new ORG
/PRODUCT
dataset instead of the ORG
dataset.