I have a dataset of article titles and predicted topics, and would like to use prodigy to annotate whether the titles (text
value) are highly relevant, moderately relevant, somewhat relevant, or irrelevant to predicted topics (label
value). I'm currently using textcat.manual, but I'm not sure that's the best recipe for my use case, and it seems the Prodigy has a different understanding of duplicate data than I am expecting.
When I run prodigy, it looks like prodigy is only keeping the first appearance of each unique text
value in my dataset, though, even though I want to review each unique text, label
combination. Here's an example of my dataset — I think prodigy would drop the third line here as a duplicate, even though the label value is different.
{"text": "This rock art may be the earliest depiction of dogs", "label": "Art"}
{"text": "Dream Job: Graffiti artist", "label": "Art"}
{"text": "This rock art may be the earliest depiction of dogs", "label": "Animals"}
And here's the command I'm using:
prodigy textcat.manual ss_reviews review_dataset.jsonl --label IRRELEVANT,SOMEWHAT_RELEVANT,RELEVANT,HIGHLY_RELEVANT --exclusive
When I run that, I get the following warning, even though I don't expect any duplicates in my dataset:
⚠ Warning: filtered 48% of entries because they were duplicates. Only
4800 items were shown out of 9211. You may want to deduplicate your dataset
ahead of time to get a better understanding of your dataset size.
Is there a better way to handle this type of task? Thanks!