Improve custom NER model performance for different input texts

dave-espinosa · February 15, 2024, 10:05pm

Hello,

I trained a custom NER model using spaCy 3.6.0 a while ago, specialized in recognizing two labels (HARDSKILL, SOFTSKILL) in 15K manually labeled job posting texts (using Prodigy Local). It performed acceptably when the input text was a job posting, but its quality got reduced when the input was something else (i.e., a curriculum or a syllabus). To improve the performance, I know I must do some further training, but I have the following queries:

Should I train one isolated model per type of input text (i.e., one custom NER model more for syllabi, and another one for curricula), or can I "resume" the training of my current model, gathering samples for the input texts where my current model is performing poorly? (i.e. re – train my current NER model with more samples with texts of syllabi and curricula)
- If I need to train "one specialized model per type of input text", how to "chain" the predictions? (i.e. how to "intersect" the entities retrieved by model1 + model2 + model 3 without having overlapping spans?) I was thinking in something like this, but I do not know if it would be the right approach.
- If on the contrary, I can "re – train my current custom NER model with more text of the poorly-performing types of texts", is there any command I could use to do this re – training of the model? Any additional recommendations or readin material? BTW, I will probably label the new texts using prodigy.
- To avoid confusions, just focus in the suggestion given as possible answer.

Thanks and BR,

Dave.

magdaaniol · February 19, 2024, 3:40pm

Hi @dave-espinosa,

Whether training training separate NER models per data type will be more effective than training one model depends a bit on how different the data types are and how much data you have available per each type. Honestly, I think it's hard to say upfront and you can get the best answer through experimentation.

For option 1) i.e. one NER per data type, I think you'd need a custom spacy pipeline per each (along the lines of the example from the spaCy board) and then add another component that would implement some logic for choosing the final prediction - probably choosing the prediction from the model with the highest confidence?

For option two 2) I think the best strategy would be to add new data to the dataset and annotate it with prodigy teachwhich will serve the examples that the model is most unsure of first.

When you add the samples of new data types (syllabi and curricula) it's probably best to add some data type identifier to the meta of each example so that you can easily do your experimentation.

Topic		Replies	Views
Problem with custom model - ner train - usage , ner , done , training	5	659	September 16, 2021
Train NER model to improve existing entities spacy vs prodigy ner , spacy	1	951	December 9, 2019
Train one multipurpose Model or multiple models for different usecases? ner , spacy , training	1	27	August 27, 2024
Adding Custom Features to Train a NER spaCy Model ner , spacy	1	699	February 16, 2021
Prodigy to Spacy Guide ner , spacy , best-practices	4	5323	January 13, 2020

Improve custom NER model performance for different input texts

Related topics