Validating relevance of text to multiple topic labels using textcat.manual

I have a dataset of article titles and predicted topics, and would like to use prodigy to annotate whether the titles (text value) are highly relevant, moderately relevant, somewhat relevant, or irrelevant to predicted topics (label value). I'm currently using textcat.manual, but I'm not sure that's the best recipe for my use case, and it seems the Prodigy has a different understanding of duplicate data than I am expecting.

When I run prodigy, it looks like prodigy is only keeping the first appearance of each unique text value in my dataset, though, even though I want to review each unique text, label combination. Here's an example of my dataset — I think prodigy would drop the third line here as a duplicate, even though the label value is different.

{"text": "This rock art may be the earliest depiction of dogs", "label": "Art"}
{"text": "Dream Job: Graffiti artist", "label": "Art"}
{"text": "This rock art may be the earliest depiction of dogs", "label": "Animals"}

And here's the command I'm using:

prodigy textcat.manual ss_reviews review_dataset.jsonl --label IRRELEVANT,SOMEWHAT_RELEVANT,RELEVANT,HIGHLY_RELEVANT --exclusive

When I run that, I get the following warning, even though I don't expect any duplicates in my dataset:

⚠ Warning: filtered 48% of entries because they were duplicates. Only
4800 items were shown out of 9211. You may want to deduplicate your dataset
ahead of time to get a better understanding of your dataset size.

Is there a better way to handle this type of task? Thanks!

hi @andrewjoelpeters!

Thanks for your question and welcome to the Prodigy community :wave:

Have you seen the docs on Prodigy's hashing and duplication?

Yes, you're right that by default Prodigy will hash by default by task (which means "task" + "input"). However, you can modify this.

Since it seems like you're looking for a "text", "label" combination, you may need to modify your recipe or create a custom recipe using set_hashes and specifying that you want your hashes set by these two fields.

from prodigy import set_hashes

stream = (set_hashes(eg, input_keys=("text", "label")) for eg in stream)

If you want to modify your built in textcat.manual recipe, you can find it locally by running prodigy stats, find the Location:, open that file path locally, then look for the file recipes/textcat.py. You could then modify it. However, if you forget this or install a new Prodigy (say in a new venv, you'd need to restart this task).

An alternative is to create your own custom recipe that implements the logic you're looking for. Check out our docs for more help and/or find the textcat.py script and use that as a template to build a custom script.

Hope this helps!

I missed the hashing and duplication of the loaders doc — that exactly what I needed. Thanks Ryan!