[ Insert time bumper here ]
Hello @koaning , hope this message finds you well.
After labeling some more texts, we have run some tests, and some conclusions can be extracted from those:
- The benchmark obtained via
train-curve
so far, is slightly above "0.55". - Past ~2000 labeled texts, the model "stops learning", as adding more texts does not (seem to) improve performance.
Questions:
- I see that in
train-curve
, there is this--base-model
parameter, which by default is set toNone
. If the command ran wasprodigy train-curve --ner station1_job1
, the pipeline used bytrain-curve
would contain['tok2vec', 'ner']
only, or what is the pipeline used? (This question is important to make sure that the whole "job posting" text is processed, and there is no sentencizing, or similar). - Provided "job posting" text length is kept, and knowing that increasing the amount of labeled texts does not seem as a promising approach, to obtain a better model performance (at least by the results obtained via
train-curve
), what could be other hints to improve the model? (BTW, I am basically following the hints provided in Ines Montani video; however in that video, other than adding new labeled samples, I cannot see any "hyperparameter tuning" or something similar that I could play around with). - Speaking of "hyperparameter tuning", would there be any benefit in training the model using spaCy? (At least by what is shown here, apparently "using Prodigy" is considered as a "different model training method" than "using spaCy" but, other than the syntax, I don't understand why they would be considered "different").
I think we can start working on these 3 questions, to see what else could be done.
Thank you!