Recommended text length for training NER models in spaCy

dave-espinosa · June 7, 2022, 10:08pm

[ Insert time bumper here ]

Hello @koaning , hope this message finds you well.

After labeling some more texts, we have run some tests, and some conclusions can be extracted from those:

The benchmark obtained via train-curve so far, is slightly above "0.55".
Past ~2000 labeled texts, the model "stops learning", as adding more texts does not (seem to) improve performance.

Questions:

I see that in train-curve, there is this --base-model parameter, which by default is set to None. If the command ran was prodigy train-curve --ner station1_job1, the pipeline used by train-curve would contain ['tok2vec', 'ner'] only, or what is the pipeline used? (This question is important to make sure that the whole "job posting" text is processed, and there is no sentencizing, or similar).
Provided "job posting" text length is kept, and knowing that increasing the amount of labeled texts does not seem as a promising approach, to obtain a better model performance (at least by the results obtained via train-curve), what could be other hints to improve the model? (BTW, I am basically following the hints provided in Ines Montani video; however in that video, other than adding new labeled samples, I cannot see any "hyperparameter tuning" or something similar that I could play around with).
Speaking of "hyperparameter tuning", would there be any benefit in training the model using spaCy? (At least by what is shown here, apparently "using Prodigy" is considered as a "different model training method" than "using spaCy" but, other than the syntax, I don't understand why they would be considered "different").

I think we can start working on these 3 questions, to see what else could be done.

Thank you!

Topic		Replies	Views
Extracting skills from job postings ner , spacy , solved , hr	11	5377	September 24, 2019
questions on Multi NERs Annotation & Training at Once in a Sentence usage , ner , spacy	5	615	October 3, 2022
Advice on training NER models with new entities usage , ner , hr	13	3884	January 25, 2019
ner.train number of examples usage , ner	8	1948	August 3, 2018
NER for short unstructured text, what am I doing wrong? ner	12	1377	November 27, 2018