Hello,
I trained a custom NER model using spaCy 3.6.0 a while ago, specialized in recognizing two labels (HARDSKILL, SOFTSKILL) in 15K manually labeled job posting texts (using Prodigy Local). It performed acceptably when the input text was a job posting, but its quality got reduced when the input was something else (i.e., a curriculum or a syllabus). To improve the performance, I know I must do some further training, but I have the following queries:
- Should I train one isolated model per type of input text (i.e., one custom NER model more for syllabi, and another one for curricula), or can I "resume" the training of my current model, gathering samples for the input texts where my current model is performing poorly? (i.e. re – train my current NER model with more samples with texts of syllabi and curricula)
- If I need to train "one specialized model per type of input text", how to "chain" the predictions? (i.e. how to "intersect" the entities retrieved by model1 + model2 + model 3 without having overlapping spans?) I was thinking in something like this, but I do not know if it would be the right approach.
- If on the contrary, I can "re – train my current custom NER model with more text of the poorly-performing types of texts", is there any command I could use to do this re – training of the model? Any additional recommendations or readin material? BTW, I will probably label the new texts using prodigy.
- To avoid confusions, just focus in the suggestion given as possible answer.
Thanks and BR,
Dave.