I'm creating a new dataset, and so far I made about 500 annotations. I ran the train-curve command, and the score decreased from 0.3 to 0.29 in the last sample. What should I do at this point to make sure I don't annotate a dataset that won't work? Are there some strategies like going back to make sure that the annotation was more consistent, or troubleshoot and find a root cause if possible? Should I just keep annotating and hope that it improves?
It's always difficult to provide generic advice for these kind of ML questions, as a lot depends on the data (size) and how the accuracy table actually looks like in detail.
In general, a 1-percentage-point drop isn't necessarily something to worry about, but it depends on the larger trend. It might be that you've come to a sort of a "plateau", where your performance might not necessarily improve anymore if you continue annotating.
If you're suspecting some kind of inconsistency in your later annotations, you might also consider using the review recipe to double-check that and enforce consistency.
But you could also just annotate another 50/100 and see what happens first, to determine whether the trend continues, or whether the 1pp drop was just a "fluke"?
If you do notice that you're stuck around 30% with your challenge, perhaps it's worth reconsidering the annotation guidelines for your approach? I don't know any details, so it's difficult to give any generic advice though.
I was checking this thread, because it's exactly my own case at the moment. What I am doing is training a custom NER model, which recognizes 4 different labels (there are a bit more details here). Following Ines Montani introductory video, I started off with 500 texts (80-20 split), just to see how things are going, and I got this result. As you can see, train-curve suggested to "add more labeled samples to training set", which I did: by adding 500 samples more (summing up a total of 1000 samples; 80-20 split), I got the following plot:
With these results, I have the following questions:
What is the metric obtained intrain-curve? (accuracy, precision, F1...)
Intuitively, it seems that this model is overfitting ; however if less samples are used, I see that the "metric" never surpasses "0.6" (which it is a bit worrying, as Ines in the video mentioned before, obtained a metric of > "0.7" in her first try). Here you mentioned "it's worth reconsidering the annotation guidelines for your approach"; where are those annotation guidelines located?
Sorry if they are too many questions, please let me know If I should open different cases for each.
This is from spaCy scorer. You can see the code/formulas from here:
FYI if you're interested, there are ways to add custom evaluation metrics (this uses textcat but should be similar for ner):
Excellent question! I searched Prodigy/spaCy documentation and found there isn't documentation on creating annotation schemes. I suspect @SofieVL meant annotation schemes in general in terms of carefully defining what each entity means. This made me realize I think there's a lot of potential opportunity with guidelines to help users.
The closest documentation I know of is Matt's 2018 PyData talk. Around 8 minutes into it, he goes through the "Applied NLP Pyramid of Greatness" where he discusses the role of defining an annotation scheme. He then goes through an example of an inadequate annotation scheme for ner. Hopefully this will give you a bit of an idea on why annotation schemes can be important.
Let me know if you have any thoughts or questions! We greatly appreciate your feedback/questions so please keep them coming