Undesirable "ignore" examples build up with low quality input streams

ines · December 28, 2017, 3:45pm

Thanks so much – it makes me really happy to see people be productive with Prodigy, and I'm also very excited about the tool you're building with it. I / we will get to answering the questions on the other thread later, since it's a little more complex!

I was actually thinking about something similar when I was doing NER annotations. My idea was to add some kind of "bookmark" option to the annotation cards (for example, a star icon in the top right corner) that'd let the user save tasks for later. For example, to reannotate them using a different interface. This would also be independent of the action you choose – for example, you could bookmark examples with wrong entity boundaries that you reject during ner.teach and re-annotate the boundaries with ner.mark later on. Or in your case, you could bookmark tasks "for later" and then ignore them so they don't have an impact during training.

Granted, this would add one more click or key press to the process. But it also means we could keep the similicity of the four action buttons, and add the bookmarking as an optional feature the user can turn on. (For example, as "bookmarks": true. The bookmarks could then be saved to an additional dataset, like "bookmarks_[name of dataset]").

I'm not sure if it'll be worth it, but you could also experiment with pre-annotating the stream to filter out garbage first and then run another session actually annotating the tweets. It sounds like more work, but garbage vs. non-garbage is a very quick decision, so once you're in a good flow, you might be looking at ~1 second per annotation here.

You could even see if you're able to train a Twitter garbage model using the data you create with this process. That model could then take care of the filtering the stream first, to improve the quality of examples. Chaining models together like this can be pretty powerful. In your case, the overall volume of the annotations seems much more important than the individual annotation. So even if your garbage detector filters out examples by mistake, there are still so many other examples to annotate, and Twitter gives you an almost endless stream of new data.

Topic		Replies	Views
Best Practices for text classifier annotations usage , textcat , best-practices	7	4996	March 24, 2021
Reviewing Ignored Cases enhancement , usage , textcat , done , review	14	1258	July 28, 2023
Skip Functionality usage	3	531	September 28, 2022
Are 'Reject' examples included in textcat_multilabel train/train-curve?	5	246	November 19, 2022
"prodigy train textcat ... " doesn't discard reject/ignore examples textcat , done	4	570	February 21, 2020

Undesirable "ignore" examples build up with low quality input streams

Related topics