Best practice for merging multiple NER datasets into one .

First off, thanks for the great tool. Its very convenient and 'oddly satisfying' to use it and have had good success with it creating my custom NERs.

As I changed the scope of the project mid-way ( wanted to add multiple custom NEs instead of just one ) am having a little trouble trying to validate the best possible workflow for the same.

eg: Lets say I have 4 Entities that Iam tagging in my training-set.
After reading many of the similar threads here - I understand that the ideal way to create/train a model with multiple Entities would be to tag all the NEs in each of the entries in the train dataset. But just tagging one NE only (at a time ) seems to be more convenient. So I followed this approach as mentioned in this thread here .
Step 1 : I create 4 different workflows - tagging only 1 Entity each and storing them into 4 different datasets. ( all trained on blank:en from scratch )
Step 2: I provide all the 4 datasets and run a train .. something like

prodigy train --ner LABEL1_db ,LABEL2_db,LABEL3_db,LABEL4_db  ./model-COMBINED --eval-split 0.2

now I have a model which should ideally identify all the 4 different NEs ?? ..
but when I try

prodigy ner.teach  ALL_LABELS_corrected_db ./model-COMBINED ../input_JSON_file.jsonl --label LABEL1,LABEL2,LABEL3,LABEL4 

or

prodigy ner.correct ALL_LABELS_corrected_db ./model-COMBINED ../input_JSON_file.jsonl --label LABEL1,LABEL2,LABEL3,LABEL4 

It identifies only one of the LABELs out of the 4 for any of these entries. ( Though I would ideally want it to identify and auto-tag all 4 LABELS in the second pass after training . So I 'correct' for a lot more of these entries with all 4 labels this time. But every correct - merge-to-gold - train cycles only gives me a new model which recognized only one of these 4 NEs in the subsequent ner.correct or ner.teach phase. How do I finally make a model that recognizes all the 4 labels in one sentence ?
Am not sure If Iam adopting the right workflow. ( Thats query 1 )

Alternatively, as Step 2 should I do the following to merge the 4 datasets ?

I db-out all 4 independently labelled datasets into 4 separate .jsonl files and merge all the 4 different .jsonl files into one .jsonl ( say - ALL_4_merged_JSONL_files.jsonl ) and create a new dataset by using the db-in ?

prodigy db-in COMBINED_db ./ALL_4_merged_JSONL_files.jsonl

and then train this combined_db for all 4 labels ?

What would be the best way/practice to combine these 4 datasets which are tagged for 4 different labels ( thats query 2 )

Query 3: If in the future, I have to add a 5th label, would adding it onto the combined model/dataset mean re-adjusting all the weights for the earlier 4 labels and some sort of performance degradation before it adjusts back to accommodate a new label ? ( In this base would it be better to make a new 5th dataset for the new label and then do the merge exercise as above ? )

Thanks for trying to understand my queries.
TomT

P.S: All labels are more or less equally distributed in the training set ( not that imbalanced )

Hi! Your apporach definitely sounds reasonable and Prodigy's train and data-to-spacy will take care of merging annotations under the hood, and will create one example per text with all annotations you've collected for that given text. If you want to add more labels in the future, you can create a new dataset with only that label, and provide it when you train or export the data with data-to-spacy. The only thing to keep an eye on are overlaps/conflicts – if your different datasets contain conflicting annotations, this can have an impact on accuracy.

One thing you could try to debug the prediction of different labels is to set --label-stats when you train and check out the per-label accuracies. If your accuracies are low for only some labels, this could indicate that the model hasn't learned much about that given label from the data. You could then investigate the data to better understand why that might be happening.

You could also run data-to-spacy to export a training corpus and run that through spaCy's debug data: https://spacy.io/api/cli#debug-data This will show you a lot of stats about your data that could help with locating specific problems.