Keeping Duplicates in Stream

vera-bernhard · April 23, 2024, 4:16pm

Hi there,

I have a dataset.jsonl looking something like this:
{'text': 'xyz', annotation_task': 'a'}
{'text': 'xyz', 'annotation_task': 'b'}
{'text': 'xyz', 'annotation_task': 'c'}
{'text': 'stu', 'annotation_task': 'a'}
{'text': 'stu', 'annotation_task': 'b'}
{'text': 'stu', 'annotation_task': 'c'}
...
According to the 'annotation_task' a different set of labels are added to the options. This set is necessary, as the annotation tasks need to be broken down but the same sample should be annotated in a row. I've created a custom recipe and use choice_style multiple. I'm working with prodigy==1.15.2.

I'm using the following to read in the data:

from prodigy.components.stream import get_stream
from prodigy import set_hashes

 stream = get_stream(dataset, dedup=False, rehash=True)
 stream = (set_hashes(eg, input_keys=("text"), task_keys=("annotation")) for eg in stream)

I specify dedup as False and assign different task_keys, however, the input data is always deduplicated and a sample only shows up with annotation_task 'a'. I've tried all sorts of rehashing but it seems as long as the 'text' key is the same in the input stream, deduplication happens. What can I do to prevent deduplication?

magdaaniol · April 24, 2024, 8:39pm

Hi @vera-bernhard ,

Could you try adding overwrite argument set to True in your call to set_hashes, that's

 stream = (set_hashes(eg, task_keys=("annotation",), overwrite=True) for eg in stream)

Also, please note that I added a comma to task_keys value as it needs to be a tuple (I just realized this detail is missing from the docs)

You'll need that because you're setting hashes already in the call to get_stream so to make your custom hashing effective, you'd need to overwrite these default ones.

Let's see of fixing the call to set_hashes solves your problem.

In any case your examples:

{'text': 'xyz', annotation_task': 'a'}
{'text': 'xyz', 'annotation_task': 'b'}
{'text': 'xyz', 'annotation_task': 'c'}
{'text': 'stu', 'annotation_task': 'a'}
{'text': 'stu', 'annotation_task': 'b'}
{'text': 'stu', 'annotation_task': 'c'}

would get each different _task_hash by default, because if no default keys are present the entire dictionary is used to generate the _task_hash and in this toy example each line is a different combination of keys and values.
Also, one thing to have in mind is that _input_hash as always prefixed to the string from which the _task_hash is computed.

vera-bernhard · April 25, 2024, 8:36am

Thanks for your quick response. The issue remained after having set task _keys and overwrite. However, I realised that a PatternMatcher further down the line, was overwriting the correctly set task_keys. Now everything is working correctly.

magdaaniol · April 25, 2024, 9:25am

Awesome! Thanks for reporting back Also, not sure if you saw it: prodigy has basic and verbose logging which is usually helpful for debugging. You can turn it on via PRODIGY_LOGGING=basic or PRODIGY_LOGGING=verbose prepended to your Prodigy cmd

Topic		Replies	Views
Parameter "dedup" in get_stream function usage , solved	3	1012	November 4, 2019
duplicate example usage	6	428	September 8, 2022
Exclude by task hash does not work bug , textcat	3	512	September 1, 2022
Duplicate examples when loaded in separate batches usage , streams	5	913	November 2, 2020
Example in Components and Functions documentation doesn't work as expected docs , done	2	422	March 2, 2021

Keeping Duplicates in Stream

Related topics