This is a digression from the “Best Practices for text classifier annotations” thread, where we were talking about how I have more ignore entries in my DB than accept and reject combined, and ways to control or prune the excessive number of ignore examples in my db.
That’s a cool idea, and I actually wrote a small helper to do that for my own testing.
That’s great to know, thanks! The more I think about ignore entries, the less convinced I am that pruning them is the best approach to solving my problem. I’m trying to workaround two things:
There are a lot of garbage entries in my streams, because of the volume of messages on Twitter, broad search terms, duplicated and/or mangled tweets, etc.
There are some entries that I want to “skip for later” and others that I want to “discard and forget”
I do some filtering in my recipe, but the unwanted ignore entries are mixed in with those that I want to save for later. This creates a related problem: if I want to go and change the answer on the “skip for now” questions, as in your example, I have to sort through all the items I wanted to “discard and forget” in order to find the ones that were potentially interesting but unclear in the moment.
So that I’m not just complaining about something without offering potential solutions, here’s one idea for how prodigy might be able to make this problem go away entirely for me. What if there were a fourth option to allow discarding examples during the annotation process? Then your actions would be something like: ["Accept", "Reject", "Skip", "Discard"] where skip could easily be “ignore” and “discard” would get rid of the example.
If retaining the ability to ignore discarded examples is desired, the input hash could be stored for later lookup during discard. If ignored items to be stored for auditing/review, like in your example, you could use a boolean configuration option to enable/disable the discard action in prodigy.json
Thanks so much – it makes me really happy to see people be productive with Prodigy, and I’m also very excited about the tool you’re building with it. I / we will get to answering the questions on the other thread later, since it’s a little more complex!
I was actually thinking about something similar when I was doing NER annotations. My idea was to add some kind of “bookmark” option to the annotation cards (for example, a star icon in the top right corner) that’d let the user save tasks for later. For example, to reannotate them using a different interface. This would also be independent of the action you choose – for example, you could bookmark examples with wrong entity boundaries that you reject during ner.teach and re-annotate the boundaries with ner.mark later on. Or in your case, you could bookmark tasks “for later” and then ignore them so they don’t have an impact during training.
Granted, this would add one more click or key press to the process. But it also means we could keep the similicity of the four action buttons, and add the bookmarking as an optional feature the user can turn on. (For example, as "bookmarks": true. The bookmarks could then be saved to an additional dataset, like "bookmarks_[name of dataset]").
I’m not sure if it’ll be worth it, but you could also experiment with pre-annotating the stream to filter out garbage first and then run another session actually annotating the tweets. It sounds like more work, but garbage vs. non-garbage is a very quick decision, so once you’re in a good flow, you might be looking at ~1 second per annotation here.
You could even see if you’re able to train a Twitter garbage model using the data you create with this process. That model could then take care of the filtering the stream first, to improve the quality of examples. Chaining models together like this can be pretty powerful. In your case, the overall volume of the annotations seems much more important than the individual annotation. So even if your garbage detector filters out examples by mistake, there are still so many other examples to annotate, and Twitter gives you an almost endless stream of new data.
Did this every get implemented? I would love to use this option, bookmarking examples especially which go on edge cases of the annotation helps us come back to them and review them. The datapoint could be valid or invalid so I really like the idea of keeping it independent of the annotation process.