hi @jspinella!
Thanks for your question.
Yes! Have you seen the get_stream function? It's a generalization loader and can deduplicate by setting its dedup argument to True (it is False by default). In this case, it'll then dedup by whatever you specify as exclude_by in your configuration. By default, it is by the "task" hash but can be changed to the "input" hash.
Typically, most recipes will load the file as a stream using get_stream then pass that stream through the recipe. However, in your case where you want to use db-in with deduplication is that you could modify the existing db-in recipe by replacing the loader with get_stream with how it is currently loaded (i.e., get_loader).
To do this, you can find the Python script for db-in by first finding the location of your Prodigy installation. You can find this by running python -m prodigy stats and find the Location: path. Open that folder and then look for the /recipes/commands.py script, where you'll see the recipe for db-in. I'd then recommend using the get_stream function to load your file.
If you can open up the db-in and view the recipe, you'll be able to see where part of that script it has set_hashes to generate them on your own. One nice feature with get_stream is that you can rehash again if you wanted by setting that argument to True.
Have you seen the filter_duplicates function too? As that part of the docs mention, it's how the deduplication is done in get_stream.
from prodigy.components.filters import filter_duplicates
from prodigy import set_hashes
stream = [{"text": "foo", "label": "bar"}, {"text": "foo", "label": "bar"}, {"text": "foo"}]
stream = [set_hashes(eg) for eg in stream]
stream = filter_duplicates(stream, by_input=False, by_task=True)
# [{'text': 'foo', 'label': 'bar', '_input_hash': ..., '_task_hash': ...}, {'text': 'foo', '_input_hash': ..., '_task_hash': ...}]
stream = filter_duplicates(stream, by_input=True, by_task=True)
# [{'text': 'foo', 'label': 'bar', '_input_hash': ..., '_task_hash': ...}]
Essentially, an alternative approach could be just to run filter_duplicates in your db-in after the hashes are set. I generally prefer using get_stream as it is more general but this is an alternative.
Hopefully, this information should give you enough to dedup as you deem fit. Let me know if this works and if you have any further questions.