Proper way to add new samples to an existing database

ryanwesslen · June 30, 2022, 8:25pm

Yes, your assumption is correct - you can just add/append the text to the same database. You're right - this is where the hashes come in to detect deduplications (either on the input or task level). The one exception may be if you had created the labels somewhere else and then uploaded them (db-in) they may not have the correct hashes.

Also it may be important to note the exclude_by setting in the configuration file. This will set which hash to use ("task" or "input" ) to determine whether two examples are the same and an incoming example should be excluded if it’s already in the dataset. By default the value is set to "task".

What recipe/interface would that reviewer use? I ask because the review recipe requires multiple datasets (e.g., in-sets). It is really meant to review and merge across different datasets by reviewer, so it may not be the one you were thinking of using since it sounded like you would have only 1 dataset.

You could create a custom recipe where you could have control. You could use some of the custom components like validate_answer, before_db, get_session_id, or exclude to create your own logic.

Since it seems like you may have multiple annotators, have you seen how to appropriately set up multi-user sessions by setting unique session_ids? By default Prodigy will create timestamped session_ids. Setting custom session_ids is helpful to name and keep track of different users (especially if you're saving them into the same dataset). If so, here's a good post on some helpful tips:

Thanks for your question!

Topic		Replies	Views
Re-annotating records usage , database , streams	4	566	May 5, 2020
Change some annotations for existing dataset usage , ner , database , review	1	846	September 23, 2020
Reviewing/Editing annotated data usage , review , streams	1	942	June 23, 2020
re-visiting a partially annotated docment	2	195	March 2, 2023
when to use db-in vs ner.manual usage , ner , database , solved	1	426	October 2, 2020

Proper way to add new samples to an existing database

Related topics