I annotated a bunch of data and have been incrementally training a spancat model to detect certain fragments in messages.
I noticed one of my categories scores much lower than others, and I recall sort of changing the 'categorization criteria' halfway through and messing it up myself.
Is there a sort of technique or method of being efficient about cleaning up 1 label, or is that nonsensical somehow and I'm just going to have to go through them all again?
It definitely makes sense to focus on one label. I do that sort of thing a fair bit myself.
The process for this does involve rolling up your sleeves and stitching multiple commands together though. The reason it's like this is we found that the number of data manipulation steps someone might want to do is really broad, and so we focussed on having general utilities.
As with other data munging tasks, there's lots of ways to go about it, and as it's a one-off process they're all about as good.
- You could use
prodigy db-out and a
grep or other filter to find the label you want, make a new
jsonl file, and queue that up for annotation.
- You could do it in Python, accessing the
db object to load the dataset, filter the examples, and save out a new dataset.
- You could have a custom recipe that adjusts the stream, filtering out the examples with a different label, or just applying a different priority sorting.
The basic concept is that you'll make a new queue with just these examples to annotate, get them reannotated, and then recombine the data. Prodigy generally takes an append-only approach to datasets to avoid losing data, so you'll want to be making a new dataset entry with the new cleaned up records, rather than trying to go back and modify the previous one.
Hope that all makes sense! Happy to explain more, especially about one of the approaches or other (it's always tough to tell what workflows someone will favour, or what they'll find intuitive for their use-case).