Parameter "dedup" in get_stream function

soumyagk · May 11, 2018, 9:00pm

I created a simple prodigy recipe that streams from a JSONL file to the prodigy interface and stores along with answer, to a dataset.
My JSONL file contains tasks in the prodigy ner format, i.e., each line is a dictionary with keys 'text' and 'spans'. It has repeated sentences but with different spans specified in a given task.
When I read the file through the get_stream function like below:

return {
        'dataset': dataset,
        'stream': get_stream(source, rehash=True, dedup=True),
        'view_id': 'ner',
        'update': None,
    }

it appears to be filtering based on input hash rather than task hash. Is that the intended bahavior?
I could use filter_duplicates function too, but I would avoid it if it can be done with the dedup parameter instead.
I am using prodigy version 1.4.2

ines · May 11, 2018, 9:45pm

Yes, the dedup parameter is explicitly for filtering out duplicate incoming examples – i.e. if your data contains the exact same text twice. get_stream mostly really cares about the input and input hashes, because in most recipes, it's run before the model or some other process, which then sets the spans, labels etc.

So in your case, using filter_duplicates seems like a good solution!

nuno · November 4, 2019, 5:56pm

Hi Ines,

What would be the correct way of using the dedup argument if you don't want duplicates to be filtered out? I tried dedup=False but it still filtered all my examples from the stream and showed "No tasks available".

This is my code:

stream = get_stream(source, api=api, loader=loader, rehash=True,
                    dedup=False, input_key='text')

return {
    'dataset': dataset,   # save annotations in this dataset
    'view_id': 'choice',  # use the choice interface
    'stream': stream,
    "choice_auto_accept": True,
    'config': {'choice_style': 'multiple', 'show_stats': True},

Thank you!

ines · November 4, 2019, 6:50pm

Hi! Setting dedup=True will filter out identical examples in the stream – e.g. {"text": "hello"} and {"text": "hello"}. Do you have anything in your dataset already? By default, Prodigy will also filter out questions that have already been answered – so that's the effect you might be seeing here. If you don't want this behaviour, you can set "auto_exclude_current": False in the "config". This means that the current dataset won't be automatically excluded.

Topic		Replies	Views
duplicate example usage	6	429	September 8, 2022
Keeping Duplicates in Stream textcat , solved	3	120	April 25, 2024
Example in Components and Functions documentation doesn't work as expected docs , done	2	422	March 2, 2021
Adding data to a Prodigy dataset using db-in - is there a way to filter out/remove duplicate annotations? usage , solved	2	420	January 4, 2023
Duplicate examples when loaded in separate batches usage , streams	5	916	November 2, 2020

Parameter "dedup" in get_stream function

Related topics