First off, thanks for the great tool. Its very convenient and 'oddly satisfying' to use it and have had good success with it creating my custom NERs.
As I changed the scope of the project mid-way ( wanted to add multiple custom NEs instead of just one ) am having a little trouble trying to validate the best possible workflow for the same.
eg: Lets say I have 4 Entities that Iam tagging in my training-set.
After reading many of the similar threads here - I understand that the ideal way to create/train a model with multiple Entities would be to tag all the NEs in each of the entries in the train dataset. But just tagging one NE only (at a time ) seems to be more convenient. So I followed this approach as mentioned in this thread here .
Step 1 : I create 4 different workflows - tagging only 1 Entity each and storing them into 4 different datasets. ( all trained on blank:en from scratch )
Step 2: I provide all the 4 datasets and run a train .. something like
prodigy train --ner LABEL1_db ,LABEL2_db,LABEL3_db,LABEL4_db ./model-COMBINED --eval-split 0.2
now I have a model which should ideally identify all the 4 different NEs ?? ..
but when I try
prodigy ner.teach ALL_LABELS_corrected_db ./model-COMBINED ../input_JSON_file.jsonl --label LABEL1,LABEL2,LABEL3,LABEL4
or
prodigy ner.correct ALL_LABELS_corrected_db ./model-COMBINED ../input_JSON_file.jsonl --label LABEL1,LABEL2,LABEL3,LABEL4
It identifies only one of the LABELs out of the 4 for any of these entries. ( Though I would ideally want it to identify and auto-tag all 4 LABELS in the second pass after training . So I 'correct' for a lot more of these entries with all 4 labels this time. But every correct - merge-to-gold - train
cycles only gives me a new model which recognized only one of these 4 NEs in the subsequent ner.correct
or ner.teach
phase. How do I finally make a model that recognizes all the 4 labels in one sentence ?
Am not sure If Iam adopting the right workflow. ( Thats query 1 )
Alternatively, as Step 2 should I do the following to merge the 4 datasets ?
I db-out
all 4 independently labelled datasets into 4 separate .jsonl files and merge all the 4 different .jsonl files into one .jsonl ( say - ALL_4_merged_JSONL_files.jsonl
) and create a new dataset by using the db-in
?
prodigy db-in COMBINED_db ./ALL_4_merged_JSONL_files.jsonl
and then train this combined_db for all 4 labels ?
What would be the best way/practice to combine these 4 datasets which are tagged for 4 different labels ( thats query 2 )
Query 3: If in the future, I have to add a 5th label, would adding it onto the combined model/dataset mean re-adjusting all the weights for the earlier 4 labels and some sort of performance degradation before it adjusts back to accommodate a new label ? ( In this base would it be better to make a new 5th dataset for the new label and then do the merge exercise as above ? )
Thanks for trying to understand my queries.
TomT
P.S: All labels are more or less equally distributed in the training set ( not that imbalanced )