Adding data to a Prodigy dataset using db-in - is there a way to filter out/remove duplicate annotations?

I have a process whereby data from an external database is converted to a Prodigy-supported JSONL file, which then feeds it into Prodigy via db-in. This process runs daily, and each time it runs, the same annotations are added over and over again to this dataset, in addition to records added to this external database since the last run.

Does Prodigy have the ability to skip importing what it identifies as duplicate annotations? I am building the JSONL file programmatically, so I can create the JSONL file however I need to in order for Prodigy to recognize duplicate annotations. For example, would I need to generate hashes for the input and task when building the JSONL file, and does Prodigy know to skip re-adding the same annotations with db-in by looking at those values? The Prodigy documentation on deduplication doesn't seem to mention deduplication per sé, rather it focuses on the ability to manually exclude annotations from recipes by specifying hashes.

I recognize this is an atypical use-case!

hi @jspinella!

Thanks for your question.

Yes! Have you seen the get_stream function? It's a generalization loader and can deduplicate by setting its dedup argument to True (it is False by default). In this case, it'll then dedup by whatever you specify as exclude_by in your configuration. By default, it is by the "task" hash but can be changed to the "input" hash.

Typically, most recipes will load the file as a stream using get_stream then pass that stream through the recipe. However, in your case where you want to use db-in with deduplication is that you could modify the existing db-in recipe by replacing the loader with get_stream with how it is currently loaded (i.e., get_loader).

To do this, you can find the Python script for db-in by first finding the location of your Prodigy installation. You can find this by running python -m prodigy stats and find the Location: path. Open that folder and then look for the /recipes/commands.py script, where you'll see the recipe for db-in. I'd then recommend using the get_stream function to load your file.

If you can open up the db-in and view the recipe, you'll be able to see where part of that script it has set_hashes to generate them on your own. One nice feature with get_stream is that you can rehash again if you wanted by setting that argument to True.

Have you seen the filter_duplicates function too? As that part of the docs mention, it's how the deduplication is done in get_stream.

from prodigy.components.filters import filter_duplicates
from prodigy import set_hashes

stream = [{"text": "foo", "label": "bar"}, {"text": "foo", "label": "bar"}, {"text": "foo"}]
stream = [set_hashes(eg) for eg in stream]
stream = filter_duplicates(stream, by_input=False, by_task=True)
# [{'text': 'foo', 'label': 'bar', '_input_hash': ..., '_task_hash': ...}, {'text': 'foo', '_input_hash': ..., '_task_hash': ...}]
stream = filter_duplicates(stream, by_input=True, by_task=True)
# [{'text': 'foo', 'label': 'bar', '_input_hash': ..., '_task_hash': ...}]

Essentially, an alternative approach could be just to run filter_duplicates in your db-in after the hashes are set. I generally prefer using get_stream as it is more general but this is an alternative.

Hopefully, this information should give you enough to dedup as you deem fit. Let me know if this works and if you have any further questions.

1 Like

This is perfect, thank you Ryan!