Hi there,
I have a dataset.jsonl looking something like this:
{'text': 'xyz', annotation_task': 'a'}
{'text': 'xyz', 'annotation_task': 'b'}
{'text': 'xyz', 'annotation_task': 'c'}
{'text': 'stu', 'annotation_task': 'a'}
{'text': 'stu', 'annotation_task': 'b'}
{'text': 'stu', 'annotation_task': 'c'}
...
According to the 'annotation_task' a different set of labels are added to the options. This set is necessary, as the annotation tasks need to be broken down but the same sample should be annotated in a row. I've created a custom recipe and use choice_style multiple. I'm working with prodigy==1.15.2.
I'm using the following to read in the data:
from prodigy.components.stream import get_stream
from prodigy import set_hashes
stream = get_stream(dataset, dedup=False, rehash=True)
stream = (set_hashes(eg, input_keys=("text"), task_keys=("annotation")) for eg in stream)
I specify dedup as False and assign different task_keys, however, the input data is always deduplicated and a sample only shows up with annotation_task 'a'. I've tried all sorts of rehashing but it seems as long as the 'text' key is the same in the input stream, deduplication happens. What can I do to prevent deduplication?
Hi @vera-bernhard ,
Could you try adding overwrite
argument set to True
in your call to set_hashes
, that's
stream = (set_hashes(eg, task_keys=("annotation",), overwrite=True) for eg in stream)
Also, please note that I added a comma to task_keys
value as it needs to be a tuple (I just realized this detail is missing from the docs)
You'll need that because you're setting hashes already in the call to get_stream
so to make your custom hashing effective, you'd need to overwrite these default ones.
Let's see of fixing the call to set_hashes
solves your problem.
In any case your examples:
{'text': 'xyz', annotation_task': 'a'}
{'text': 'xyz', 'annotation_task': 'b'}
{'text': 'xyz', 'annotation_task': 'c'}
{'text': 'stu', 'annotation_task': 'a'}
{'text': 'stu', 'annotation_task': 'b'}
{'text': 'stu', 'annotation_task': 'c'}
would get each different _task_hash by default, because if no default keys are present the entire dictionary is used to generate the _task_hash and in this toy example each line is a different combination of keys and values.
Also, one thing to have in mind is that _input_hash
as always prefixed to the string from which the _task_hash
is computed.
Thanks for your quick response. The issue remained after having set task _keys and overwrite. However, I realised that a PatternMatcher further down the line, was overwriting the correctly set task_keys. Now everything is working correctly.
Awesome! Thanks for reporting back Also, not sure if you saw it: prodigy has basic and verbose logging which is usually helpful for debugging. You can turn it on via PRODIGY_LOGGING=basic
or PRODIGY_LOGGING=verbose
prepended to your Prodigy cmd