Recommended text length for training NER models in spaCy

[ Insert time :hourglass_flowing_sand: bumper here ]

Hello @koaning , hope this message finds you well.

After labeling some more texts, we have run some tests, and some conclusions can be extracted from those:

  • The benchmark obtained via train-curve so far, is slightly above "0.55".
  • Past ~2000 labeled texts, the model "stops learning", as adding more texts does not (seem to) improve performance.

Questions:

  1. I see that in train-curve, there is this --base-model parameter, which by default is set to None. If the command ran was prodigy train-curve --ner station1_job1, the pipeline used by train-curve would contain ['tok2vec', 'ner'] only, or what is the pipeline used? (This question is important to make sure that the whole "job posting" text is processed, and there is no sentencizing, or similar).
  2. Provided "job posting" text length is kept, and knowing that increasing the amount of labeled texts does not seem as a promising approach, to obtain a better model performance (at least by the results obtained via train-curve), what could be other hints to improve the model? (BTW, I am basically following the hints provided in Ines Montani video; however in that video, other than adding new labeled samples, I cannot see any "hyperparameter tuning" or something similar that I could play around with).
  3. Speaking of "hyperparameter tuning", would there be any benefit in training the model using spaCy? (At least by what is shown here, apparently "using Prodigy" is considered as a "different model training method" than "using spaCy" but, other than the syntax, I don't understand why they would be considered "different").

I think we can start working on these 3 questions, to see what else could be done.

Thank you!