Review Process - Focus on specific label/tag

jrouss · February 9, 2024, 2:57am

I annotated a bunch of data and have been incrementally training a spancat model to detect certain fragments in messages.

I noticed one of my categories scores much lower than others, and I recall sort of changing the 'categorization criteria' halfway through and messing it up myself.

Is there a sort of technique or method of being efficient about cleaning up 1 label, or is that nonsensical somehow and I'm just going to have to go through them all again?

honnibal · February 9, 2024, 11:37am

It definitely makes sense to focus on one label. I do that sort of thing a fair bit myself.

The process for this does involve rolling up your sleeves and stitching multiple commands together though. The reason it's like this is we found that the number of data manipulation steps someone might want to do is really broad, and so we focussed on having general utilities.

As with other data munging tasks, there's lots of ways to go about it, and as it's a one-off process they're all about as good.

You could use prodigy db-out and a grep or other filter to find the label you want, make a new jsonl file, and queue that up for annotation.
You could do it in Python, accessing the db object to load the dataset, filter the examples, and save out a new dataset.
You could have a custom recipe that adjusts the stream, filtering out the examples with a different label, or just applying a different priority sorting.

The basic concept is that you'll make a new queue with just these examples to annotate, get them reannotated, and then recombine the data. Prodigy generally takes an append-only approach to datasets to avoid losing data, so you'll want to be making a new dataset entry with the new cleaned up records, rather than trying to go back and modify the previous one.

Hope that all makes sense! Happy to explain more, especially about one of the approaches or other (it's always tough to tell what workflows someone will favour, or what they'll find intuitive for their use-case).

Topic		Replies	Views
Search functionality for labels usage , front-end	1	448	August 31, 2021
Bulk filter/review of dataset after tagging usage	5	441	April 27, 2022
Present span labels in groups in span classification task enhancement , usage , ner , custom , front-end	5	425	May 4, 2023
Dynamically defining subset of labels to use in SpanCat usage , spancat	5	330	November 6, 2024
Using your UI on imported data for classification and annotation usage , ner , textcat	5	1216	August 28, 2018

Review Process - Focus on specific label/tag

Related topics