I was wondering what the best way would be to to merge my new custom model with new entities (from blank:en) with the generic en_core_web_lg model?
What if I created two NER models each with a custom entity (ENTITYA, ENTITYB) and wanted to merge them together, how would I do that?
And do the predictions depend on available entities? For example, if were to retrain the merged model on the original dataset, would it mean because ENTITYA is related to ENTITYB, this relationship will be reflected and the quality of the model will increase?
Hi! By "merge", do you mean create an NER model that predicts the same entity types as en_core_web_lg, but with your custom entities? Or do you want to keep the other components (tagger, parser) and use a custom entity recognizer?
Ideally, you'd want to train one entity recognizer with data for both custom entity types. The presence and absence of one entity can indeed make a difference for the prediction of other entities – for example, a (potentially incorrect) prediction of ENTITYA may become a lot less likely if it's directly following ENTITYB, and if a token is already part of ENTITYB with a high confidence, it means that it can't be in ENTITYA, and so on. So this can definitely give you a significant boost in accuracy.
If you want to update a model that already predicts other entity types, make sure to include enough data with those entity types as well, not just your new types. Otherwise, your model may "forget" what it previously predicted correctly. An easy way to create data for this in Prodigy is to use a recipe like ner.correct and have the model pre-highlight all entities for all labels you're interested in, plus your new labels. This way, your data includes everything the model previously got right and will be "reminded" of it during training.
By default, Prodigy will skip examples that are already in the dataset – in ner.correct, it will exclude an example with the same text. That typically makes sense so you don't get asked about the same text twice (even if it's with different suggestions).
In your case, you could just use separate datasets, e.g. one for MONEY, one for ENTITYA and one for ENTITYB. When you train your final model with train, you can specify multiple datasets and the annotations get merged automatically. This is also a good approach in the development phase because it makes it easy to start over: for example, maybe you start labelling ENTITYB and after 50 examples, you realise that your label definition is unideal and you should be labelling ENTITYC and ENTITYD instead. You can then create a new dataset and start over.
Alternatively, you can also run ner.correct with your base model, the label(s) you want to keep and the new labels. You can always add more labels when you annotate, even if the model doesn't know about them. So if you use --label MONEY,ENTITYA,ENTITYB, this will show you the model's predictions for MONEY, and will let you manually add annotations for your other entity types on top of it.
Finally, you can also use an existing dataset as the input source (instead of the JSON file) by writing dataset:your_dataset_name. So if you already have annotations, you can load them back in, edit them and save the results to a new set.
Ah yes, the problem in your case is that you're asking Prodigy to show you the model's predictions on a pre-annotated dataset. This will remove any existing spans and replace them with the model's predictions, because otherwise, the result would be very confusing and you wouldn't be able to tell where the suggestions come from. And if you're pre-training a model with only one label, it makes sense that the model is only able to produce that one label.
My initial suggestion was to just train one model on both datasets you've created for ENTITYA and ENTITYB. So when you train, just pass in entitya_dataset and entityb_dataset (for instance). This gives you a model that predicts both, and then you can correct its predictions.
Thank you for your support. I trained one model on both datasets. That works!
Here is the next catch
My two datasets are basically the same dataset (except the one with ENTITYA has 30% more data). The ENTITIES relate to each other but don't overlap.
So I would like to achieve one perfect dataset with all entities.
I've spent hours perfectly annotating it with ENTITYA and wouldn't like to reannotate it / check annotations again.
Is there a way how to just simply add ENTITYB predictions from my temp ENTITYB model so I can ramp up my annotation process?
And as you pointed out above, when the entities are together, the model's quality may increase dramatically!
EDIT: May I also ask, what would be the recommended way to add the ORG entity from the web_lg model into my perfectly annotated dataset? ORG doesn't overlap. In my novice head, I would just run a web_lg model to give me predictions. I would have to do some work to correct the predictions, but 80% of the workload would be done.
There are just two things to keep in mind here: first, make sure the spans are provided in order (so just sort them by start). And second, make sure you don't have conflicting or overlapping spans in there. If you do end up with two conflicting spans, you'd have to decide which one to prioritise – either the one by your model, or the one that's already in the data.