Validating relevance of text to multiple topic labels using textcat.manual

andrewjoelpeters · April 26, 2023, 2:15pm

I have a dataset of article titles and predicted topics, and would like to use prodigy to annotate whether the titles (text value) are highly relevant, moderately relevant, somewhat relevant, or irrelevant to predicted topics (label value). I'm currently using textcat.manual, but I'm not sure that's the best recipe for my use case, and it seems the Prodigy has a different understanding of duplicate data than I am expecting.

When I run prodigy, it looks like prodigy is only keeping the first appearance of each unique text value in my dataset, though, even though I want to review each unique text, label combination. Here's an example of my dataset — I think prodigy would drop the third line here as a duplicate, even though the label value is different.

{"text": "This rock art may be the earliest depiction of dogs", "label": "Art"}
{"text": "Dream Job: Graffiti artist", "label": "Art"}
{"text": "This rock art may be the earliest depiction of dogs", "label": "Animals"}

And here's the command I'm using:

prodigy textcat.manual ss_reviews review_dataset.jsonl --label IRRELEVANT,SOMEWHAT_RELEVANT,RELEVANT,HIGHLY_RELEVANT --exclusive

When I run that, I get the following warning, even though I don't expect any duplicates in my dataset:

⚠ Warning: filtered 48% of entries because they were duplicates. Only
4800 items were shown out of 9211. You may want to deduplicate your dataset
ahead of time to get a better understanding of your dataset size.

Is there a better way to handle this type of task? Thanks!

ryanwesslen · April 26, 2023, 4:54pm

hi @andrewjoelpeters!

Thanks for your question and welcome to the Prodigy community

Have you seen the docs on Prodigy's hashing and duplication?

Yes, you're right that by default Prodigy will hash by default by task (which means "task" + "input"). However, you can modify this.

Since it seems like you're looking for a "text", "label" combination, you may need to modify your recipe or create a custom recipe using set_hashes and specifying that you want your hashes set by these two fields.

from prodigy import set_hashes

stream = (set_hashes(eg, input_keys=("text", "label")) for eg in stream)

If you want to modify your built in textcat.manual recipe, you can find it locally by running prodigy stats, find the Location:, open that file path locally, then look for the file recipes/textcat.py. You could then modify it. However, if you forget this or install a new Prodigy (say in a new venv, you'd need to restart this task).

An alternative is to create your own custom recipe that implements the logic you're looking for. Check out our docs for more help and/or find the textcat.py script and use that as a template to build a custom script.

Hope this helps!

andrewjoelpeters · April 26, 2023, 5:06pm

I missed the hashing and duplication of the loaders doc — that exactly what I needed. Thanks Ryan!

Topic		Replies	Views
textcat.teach uncertain sorter show options with score 0 usage , textcat	3	397	August 30, 2022
Textcat possible problem with uneven dataset? usage , textcat , done	2	956	January 17, 2020
Same text appearing twice (with matches and without) textcat	5	465	December 13, 2022
Textcat - same data keeps appearing usage , textcat	3	518	July 23, 2019
No tasks available on textcat.teach with 65k items after 2 feedbacks usage , textcat	5	366	May 27, 2022

Validating relevance of text to multiple topic labels using textcat.manual

Related topics