For my current task, I'm spending around 10 seconds labeling (I'm getting 10 examples per batch) and then around 60 seconds waiting for the next batch. Is it possible to increase the batch size so I have a (much) larger batch before I start labelling? Or is this the wrong way to think about things, e.g. maybe
- I am reaching the limits of what active learning can help with
- I should be using keywords to filter down my dataset (a lot of my examples don't contain any entities)
On this second point, is there a way to leverage the patterns file for this, i.e. only annotate examples which have a pattern match?
Which model are you using and how long are your texts? If you're using
ner.teach, Prodigy will not just ask the model for the best analysis, but multiple possible analyses – so the longer your texts, the longer this may take. So if you're not already doing this, try using shorter examples, like single sentences.
In addition to that, you can also experiment with changing the
batch_size setting. When the queue is running low, Prodigy will fetch the next batch of examples in the background, so maybe you can find a good trade-off batch size where it takes you long enough to annotate so that the model already has the next batch ready in the background.
If you get the feeling that you're not seeing enough examples for the given score threshold, then that's definitely an option. It will give your model more positive examples to learn from. If you only want to annotate examples with matches, check out the
match recipe: https://prodi.gy/docs/recipes#match If you're annotating entities, you want to set
--label-span to add the matched label to the span.
Alternatively, you could also start with a fully manual or semi-manual round of annotation using
ner.manual (with patterns) or
ner.correct (if you have a model that already predicts something).