Excluding examples in a new database that are present in another one


What is the best approach for excluding text examples from a brand new database when such examples are also present (have been annotated before) on another database? For instance, I have a JSON lines file with 100 examples. I create database_1 and annotate 20 examples. I then create a new database_2 and want to use the same input file but exclude the 20 examples already in database_1.


Hi @ale,

You can filter out inputs from another dataset by adding a dedicated stream preprocessing function to your recipe. We have a helper function for this purpose filter_inputs. Thus function filters based on_input_hash attribute. If you read your input to database_1 with the same method e.g. get_stream (concretely the same method of assigning the _input_hash) as database_2 it should work fine.
In your recipe, after reading the input stream, you could add the following:

# extra imports
from prodigy.components.db import connect
from prodigy.components.filters import filter_inputs
stream = get_stream("source_jsonl.jsonl") # this will assign input hashes
db = connect()
# get input hashes to be excluded
input_hashes_to_exclude = db.get_input_hashes(["database_1"])
stream = filter_inputs(stream, input_hashes_to_exclude)