Best way to re-label / re-annotate existing data based on condition

Dear prodigy-Team,

I am currently using prodigy to create a ner "benchmark" dataset in order to compare the performance of several models trained on synthetic data.
So far everything works great, I am using ner.correct with the --update flag to have a pre-trained model in the loop which "helps" me doing the labeling.
But after that, during the benchmarking I checked the labels which the models didn't predict (false negatives) and realized that I made a few mistakes regarding the annotations (false span boundaries).

So I would like to re-label or re-annotate all the examples for which the model didn't predict the correct ner-spans (false negatives).

Question: what would be the best way to do this?

I thought about

  • exporting the dataset (as *.jsonl)
  • split the data into two categories: correct ones and the ones to re-annotate
  • import both into prodigy again
  • run ner.correct on the "re-annotate" dataset
  • use data-to-spacy to combine both datasets and export them as one

But maybe there is a better way? I feel that maybe a custom recipe would be a good Idea as we just startet working with prodigy and I think we will need this "conditioned re-labeling" more than once.
It would be great if you would have a suggestion on how implement this as a custom recipe

Thanks!

Your method sounds reasonable to me, although you might also just use a ner.manual with prelabelled data as well if you find that more intuitive. I usually have a notebook with a script that can pull data from Prodigy and can make a candidate list of items to double-check. And in general, I recommend looking at examples where the annotation and the model disagree on a label. Usually, there's something insightful when that happens.

You might appreciate this Prodigy video I made a while ago on the topic of "finding bad labels". I'm showcasing a few general techniques for text classification that you might find inspiring.

If you're annotating with a team, I might try to formalize this a bit further, mainly because you want to document and prevent annotation mistakes from happening in the future.