hi @joebuckle!
Yes! Have you tried the train-curve
recipe?
This recipe is designed to test how accuracy improves with more annotated data. The key is to look for the shape of the training curve. If you find your curve is increasing near the end (e.g., last 25%), this indicates you may get incremental value (information) by labeling more. However, if you find the curve is starting to "level off", this can indicate "diminishing marginal returns". Said differently, there isn't a lot of value of labeling more. If you want to improve your model, you may find you need to rethinking your annotation scheme (e.g., change your class definitions).
You can find several support issues that help in the interpretation and use of train-curve
:
There are also tips on customizing train-curve
, like adding label stats and saving results:
Or modifying the evaluation metrics:
Are you creating your own evaluation dataset on your own or are you allowing Prodigy to do it for you automatically?
As the previous post mentions, you may want to create a dedicated hold-out (evaluation) dataset if you haven't already. In the train
docs, there's this tip:
For each component, you can provide optional datasets for evaluation using the
eval:
prefix, e.g.--ner dataset,eval:eval_dataset
. If no evaluation sets are specified, the--eval-split
is used to determine the percentage held back for evaluation.
Let me know if you have any questions on how to write a script for this.
As you may have tried, to improve specific labels, you can use active learning (textcat-teach
), model-in-the-loop predictions (e.g., textcat.correct
), or patterns (rules). Here are the docs for doing this with text classification.
Last, if you're working with multiple annotators, another approach to answering " How much data do I need to label?" is to consider bootstrapping for inter-rater reliability. My colleague Peter Baumgartner recently wrote an interesting blog post:
It's important to note that bootstrapping is a general concept that can be used for any statistic, but is typically computationally intensive which is the limiting factor. For example, you could "bootstrap" (sample with replacement) train-curve
which would provide uncertainty estimates on accuracy. The problem is this may take a very long time to estimate.