Parameter "dedup" in get_stream function

I created a simple prodigy recipe that streams from a JSONL file to the prodigy interface and stores along with answer, to a dataset.
My JSONL file contains tasks in the prodigy ner format, i.e., each line is a dictionary with keys 'text' and 'spans'. It has repeated sentences but with different spans specified in a given task.
When I read the file through the get_stream function like below:

return {
        'dataset': dataset,
        'stream': get_stream(source, rehash=True, dedup=True),
        'view_id': 'ner',
        'update': None,

it appears to be filtering based on input hash rather than task hash. Is that the intended bahavior?
I could use filter_duplicates function too, but I would avoid it if it can be done with the dedup parameter instead.
I am using prodigy version 1.4.2

Yes, the dedup parameter is explicitly for filtering out duplicate incoming examples – i.e. if your data contains the exact same text twice. get_stream mostly really cares about the input and input hashes, because in most recipes, it’s run before the model or some other process, which then sets the spans, labels etc.

So in your case, using filter_duplicates seems like a good solution! :+1:

1 Like

Hi Ines,

What would be the correct way of using the dedup argument if you don't want duplicates to be filtered out? I tried dedup=False but it still filtered all my examples from the stream and showed "No tasks available".

This is my code:

stream = get_stream(source, api=api, loader=loader, rehash=True,
                    dedup=False, input_key='text')

return {
    'dataset': dataset,   # save annotations in this dataset
    'view_id': 'choice',  # use the choice interface
    'stream': stream,
    "choice_auto_accept": True,
    'config': {'choice_style': 'multiple', 'show_stats': True},

Thank you!

Hi! Setting dedup=True will filter out identical examples in the stream – e.g. {"text": "hello"} and {"text": "hello"}. Do you have anything in your dataset already? By default, Prodigy will also filter out questions that have already been answered – so that's the effect you might be seeing here. If you don't want this behaviour, you can set "auto_exclude_current": False in the "config". This means that the current dataset won't be automatically excluded.