Hello @SofieVL
I was checking this thread, because it's exactly my own case at the moment. What I am doing is training a custom NER model, which recognizes 4 different labels (there are a bit more details here). Following Ines Montani introductory video, I started off with 500 texts (80-20 split), just to see how things are going, and I got this result. As you can see, train-curve suggested to "add more labeled samples to training set", which I did: by adding 500 samples more (summing up a total of 1000 samples; 80-20 split), I got the following plot:
Comparing those results with the first test, 2 quick observations can be obtained:
- The metric increase was not much significant.
- Again, adding more samples could improve the model's perfomance.
This time, I decided to increase the amount of added samples a bit more: 4000 labeled texts were used this time (80-20 split), to obtain the following results:
Sadly this time, the performance worsened.
With these results, I have the following questions:
-
What is the metric obtained in
train-curve
? (accuracy, precision, F1...) - Intuitively, it seems that this model is overfitting ; however if less samples are used, I see that the "metric" never surpasses "0.6" (which it is a bit worrying, as Ines in the video mentioned before, obtained a metric of > "0.7" in her first try). Here you mentioned "it's worth reconsidering the annotation guidelines for your approach"; where are those annotation guidelines located?
Sorry if they are too many questions, please let me know If I should open different cases for each.
Thank you!