Working with text classification, I have a list of documents which already have been assigned a label outside of Prodigy. However, some of these documents have been wrongly labeled or I simply don’t want to include them in training a text classification model.
I thought I could use Prodigy to confirm or reject documents and respective label. I want to use textcat teach interface to accept/reject document-labels pairs but I want Prodigy to suggest exactly the label which has already been assigned previously.
Yes, that should be pretty straightforward! Let’s assume your input data looks something like this:
{"text": "This is a text", "label": "LABEL_ONE"}
{"text": "This is another text", "label": "LABEL_TWO"}
This is how Prodigy usually represents texts with a label, and the format should hopefully be very easy to generate from whichever format you already have.
You can then load the data into Prodigy and annotate it. The mark recipe takes the exact data it receives, renders it in a given interface and presents it to you so you can accept/reject it. So basically, exactly what you want to do. Here’s an example:
prodigy mark your_dataset your_data.jsonl --view-id classification
You can then run db-out to export your dataset, and each example will have an "answer" key with either "accept", "reject" or "ignore". You can then use that information to filter the examples so you only have the accepted ones, double-check the rejected answers or do whatever else you need.
Here’s the simplified and annotated source of the mark recipe btw so you can see what’s happening under the hood when you run it: