hi @jspinella!
Thanks for your question.
Yes! Have you seen the get_stream
function? It's a generalization loader and can deduplicate by setting its dedup
argument to True
(it is False
by default). In this case, it'll then dedup by whatever you specify as exclude_by
in your configuration. By default, it is by the "task"
hash but can be changed to the "input"
hash.
Typically, most recipes will load the file as a stream using get_stream
then pass that stream through the recipe. However, in your case where you want to use db-in
with deduplication is that you could modify the existing db-in
recipe by replacing the loader with get_stream
with how it is currently loaded (i.e., get_loader
).
To do this, you can find the Python script for db-in
by first finding the location of your Prodigy installation. You can find this by running python -m prodigy stats
and find the Location:
path. Open that folder and then look for the /recipes/commands.py
script, where you'll see the recipe for db-in
. I'd then recommend using the get_stream
function to load your file.
If you can open up the db-in
and view the recipe, you'll be able to see where part of that script it has set_hashes
to generate them on your own. One nice feature with get_stream
is that you can rehash again if you wanted by setting that argument to True
.
Have you seen the filter_duplicates
function too? As that part of the docs mention, it's how the deduplication is done in get_stream
.
from prodigy.components.filters import filter_duplicates
from prodigy import set_hashes
stream = [{"text": "foo", "label": "bar"}, {"text": "foo", "label": "bar"}, {"text": "foo"}]
stream = [set_hashes(eg) for eg in stream]
stream = filter_duplicates(stream, by_input=False, by_task=True)
# [{'text': 'foo', 'label': 'bar', '_input_hash': ..., '_task_hash': ...}, {'text': 'foo', '_input_hash': ..., '_task_hash': ...}]
stream = filter_duplicates(stream, by_input=True, by_task=True)
# [{'text': 'foo', 'label': 'bar', '_input_hash': ..., '_task_hash': ...}]
Essentially, an alternative approach could be just to run filter_duplicates
in your db-in
after the hashes are set. I generally prefer using get_stream
as it is more general but this is an alternative.
Hopefully, this information should give you enough to dedup as you deem fit. Let me know if this works and if you have any further questions.