Merging single label-based models into one multiple label-model

Hi,

I am new at prodigy and working on a NER model with multiple new labels. I am wondering how to optimise the creation of such model.

This model should have between 10 and 15 labels, mostly derived from Spacy native named entities (PERSON, ORG, NORP,....). In fact, some labels should be exactly as in Spacy model (e.g PERSON), while others (e.g NORP), should be split into different labels (e.g NAT for nationalities, POA for political affiliation).

I first try to manually annotate sentences from a large dataset with all labels, using a pattern json file with patterns for my set of labels. But I soon realised that I was getting confused and making mistakes (missing labels and changing rules with time). Also, my large dataset was not particularly optimal for some labels.

So my second approach was the following: manually annotate each label using a dedicated dataset (so that I would find named entities on almost every example). In short, for each label, I used ner.manual, then trained the model and then ner.correct until I reached the F-score I wanted (>~80%).

That worked well and gave me good precision and recall for each model (one model per label).

What is the best way to combine my single label-based models into one multiple labels model?

Or, if I had trained a model with say 5 new named entities and later on created a new label, how could I "add it" to my existing model?

Or perhaps, assuming I was entirely satisfied with Spacy NER model performance for the label PERSON but wanted to use my trained model for ORG (for instance), would there be a way to do so?

Thanks in advance for your help.

PaulineB

Hi! I think your experience and intuition makes sense here – we've also found that annotating all labels together (especially for large sets) can often makes things more difficult because you constantly have to think about all labels and it makes it harder to adjust the label scheme during the development phase.

Ideally, you'd always want to be training from scratch using all annotations. The presence or absence of one label can have an impact on all other labels, so it makes sense to train your model on all labels combined. So whenever you have a new label, you add the annotations for it and then retrain on all datasets.

This souldn't be more work either – if you're using Prodigy v1.9+, the train and data-to-spacy recipes will take care of merging annotations from multiple datasets automatically. The merged data wll only contain each example once, and all annotations referring to that example will be merged togethet (data-to-spacy even merges annotations of different types, like text classification and NER.

If you're annotating a new label and you're worried that there might be overlap with other labels, you could also re-annotate an existing dataset with another label. So if you have a dataset with ORG anotations and you want to add annotations for PRODUCT, export the data with db-out and re-annotate it with --label ORG,PRODUCT. You'll the the existing annotations and can adjust them if needed, and you'll be able to add new annotations for PRODUCT. Later on, you can then train with your new ORG/PRODUCT dataset instead of the ORG dataset.

Thank you Ines,

Your answer makes complete sense. I realised that my problem was that some labels were more scarce than others and so I had created special datasets for those and got good scores for the single label models. But when combining the datasets the overall multi-labels model performance was degraded, most likely because I didn't annotate these datasets with the remaining labels. As per your recommendation I will train my model for all labels on each dataset before combining them to train the multi-label models.

So far I have been using merge_spans to combine the dataset but I can also use data-to-spacy, it should amount to the same for NER right?

Thank you,

PaulineB

1 Like

Yes, exactly! If you have a choice, I'd definitely recommend using data-to-spacy, as it also performs additiona validation (e.g. if you accidentally use the wrong dataset with the wrong types of annotation), and it can handle different annotation types and create combined corpora.