Proper way to add new samples to an existing database

Hi I am currently using ner.manual to annotate samples using a text file as input. I recently trained a model on these annotations and found that the model is performing poorly and I believe it needs more examples (using < 500 samples on en_core_web_trf model).
I have two questions:

  1. What is the proper procedure to add new samples to the db without losing the current annotations.
  2. If I had a reviewer already review annotations from the aforementioned db, would updating the first database influence the reviews task (i.e adding the new samples after the annotator annotates the newly added samples)

My assumption is that we would just add/append text to the input text file and prodigy will figure out (using hashes) which text has been annotated and which ones are new and will add it as so to the annotators task. If its a different procedure could you describe it?

hi @klopez!

Yes, your assumption is correct - you can just add/append the text to the same database. You're right - this is where the hashes come in to detect deduplications (either on the input or task level). The one exception may be if you had created the labels somewhere else and then uploaded them (db-in) they may not have the correct hashes.

Also it may be important to note the exclude_by setting in the configuration file. This will set which hash to use ("task" or "input" ) to determine whether two examples are the same and an incoming example should be excluded if it’s already in the dataset. By default the value is set to "task".

What recipe/interface would that reviewer use? I ask because the review recipe requires multiple datasets (e.g., in-sets). It is really meant to review and merge across different datasets by reviewer, so it may not be the one you were thinking of using since it sounded like you would have only 1 dataset.

You could create a custom recipe where you could have control. You could use some of the custom components like validate_answer, before_db, get_session_id, or exclude to create your own logic.

Since it seems like you may have multiple annotators, have you seen how to appropriately set up multi-user sessions by setting unique session_ids? By default Prodigy will create timestamped session_ids. Setting custom session_ids is helpful to name and keep track of different users (especially if you're saving them into the same dataset). If so, here's a good post on some helpful tips:

Thanks for your question!

Thank you for your response @ryanwesslen This is excellent. Just to clarify, I assume that if I set it to input for the exclude_by setting this would exclude samples by the hash of the text. So I would just append the new text and it will add new "tasks" for that annotator. And in theory if I were to add the same text again it should not show up on the annotators task list.

Reviewer background:
For the reviewer, this review recipe can take in 1 in-set. It will show the difference between the two annotators and allow the reviewer to adjudicate between the two. I have set the two annotators with their own session for ease of identification. We use this review task to get a gold standard dataset to train a model on later. Currently our annotators have annotated each sample giving 2 annotators per sample and our reviewer has reviewed and adjudicated all the samples and generated a gold dataset.

Question: The crux of this question is if the annotators now have new samples to work on, and they complete the task, will the reviewer (after the annotators finish of course) get new samples to review? or will we have to restart the entire review task again?

Okay, I have actually tested this on a dummy db, I did the following:

  1. I set the exclude_by flag to input, and set feed_overlap to true
  2. I created a new text file with some sentences, and started a ner.manual task with labels
  3. I then annotated it with 2 annotators
  4. I then started a review task to adjudicate the annotators
  5. I then added/appended text to the original text file
  6. The annotators only saw the new text in their tasks!
  7. I then ran the review on the newly updated db to see if it would make me review all again. It only loaded the new samples
  8. I then took the input file, copied its contents into a new file, added some new sentences, and used this as input to the task. (same command just different input name/file with same data)
  9. This only showed the annotators new samples
  10. Same with the review task, only new samples.
  11. I then tried adding duplicate samples in the text file to see if it would add the duplicated samples into the task for the annotators, it did not!

Thank you @ryanwesslen for all the help!

1 Like

Excellent! I really like your approach. If I find time, I may try to create a mermaid diagram of the workflow and repost. This would be an excellent workflow to document -- especially if you have inter-annotator agreement too :smiley: !

Thank you!

1 Like