Excluding examples in a new database that are present in another one

ale · May 30, 2024, 7:00am

Hi,

What is the best approach for excluding text examples from a brand new database when such examples are also present (have been annotated before) on another database? For instance, I have a JSON lines file with 100 examples. I create database_1 and annotate 20 examples. I then create a new database_2 and want to use the same input file but exclude the 20 examples already in database_1.

Thanks

magdaaniol · May 30, 2024, 4:23pm

Hi @ale,

You can filter out inputs from another dataset by adding a dedicated stream preprocessing function to your recipe. We have a helper function for this purpose filter_inputs. Thus function filters based on_input_hash attribute. If you read your input to database_1 with the same method e.g. get_stream (concretely the same method of assigning the _input_hash) as database_2 it should work fine.
In your recipe, after reading the input stream, you could add the following:

# extra imports
from prodigy.components.db import connect
from prodigy.components.filters import filter_inputs
(...)
stream = get_stream("source_jsonl.jsonl") # this will assign input hashes
db = connect()
# get input hashes to be excluded
input_hashes_to_exclude = db.get_input_hashes(["database_1"])
stream = filter_inputs(stream, input_hashes_to_exclude)
(...)

Topic		Replies	Views
Deleting examples from DB usage , database	9	2177	October 14, 2019
Use previous annotations for new dataset usage , textcat , database	2	872	February 14, 2021
Filter already annotated text usage , solved , streams	2	614	December 27, 2021
Adding data to a Prodigy dataset using db-in - is there a way to filter out/remove duplicate annotations? usage , solved	2	417	January 4, 2023
Not all annotated samples received using Prodigy SDK database	5	580	November 21, 2019

Excluding examples in a new database that are present in another one

Related topics