Annotated Dataset and NER task with Prodigy


I'm new to using Prodigy and have a few questions. Our team is working on a task to extract detoxification events from clinical notes written by healthcare providers. We define our custom label for each detox event and we approaching this as a named entity recognition task for automatically labeling certain events based on SME guidance.

Currently, we are working with 15095 snippets. We've had a SME use Prodigy with the following command:

prodigy ner.manual detox_event_extraction blank:en data_detox.jsonl --label <labels>

The actual labels we use is not important. The SME has annotated 1000 snippets out of the 15095 snippets and saved it a database file called detox_event_extraction.db. I've extracted the those 1000 annotated snippets in a separate json file annotated_snippets.jsonl. I have the following questions:

  1. What information does the database actually hold? Specifically, does it hold only annotated dataset or does it hold the entire dataset?
  2. Due to some issues, I had to delete the database file. But I have the 1000 annotated snippets (annotated_snippets.jsonl) and the original data file (data_detox.jsonl). How do I recreate the database file using these files so that if needed the SME can continue annotating from snippet 1001?

The workflow that I'd like to follow is similar to the ingredients NER video by Ines Montani. Specifically, I want to use the 750 out of the 1000 annotated snippets to train a spacy model (250 for eval) and then use a ner.correct recipe with the SME to annotate a further 1000 notes. Eventually, I want to use ner.teach for active learning with the SME to train the model with another few 1000 notes. I'm saving a final 5000+ snippets as a final test set. I have the following questions:

  1. Is what I have described a typical workflow for the task that I want done?
  2. I'm having a hard time understanding how the database file incorporates all the annotations provided by ner.manual, ner.correct, and ner.teach. Do I save different database files for each session?
  3. Does the model continuously update with ner.correct? If not, do I just train a new model with updated annotations?
  4. Does the model continuously update with ner.teach? Is there an example how I could go about doing this?

I apologize for the long post and would appreciate any help regarding thes questions.


Hi @sudarshan !

It only holds the annotated dataset. Whenever you're done annotating and once you "saved" your annotations, their values (whether you accepted, rejected, or ignored them) will be saved into the db.

You can use the db-in command. Prodigy will skip annotations that were already in the dataset and you can just keep annotating with the original file :slight_smile:

Oops my reply got cut, here are the answers for the next set of questions :slight_smile: Thanks for listing them in an accessible manner, anyway! :smiley:

The typical workflow is to annotate everything, then do a final training with all of your annotated data. Ideally, we'd only use ner.teach to improve the quality-of-life of our annotation process, we won't use it to train a model that goes to prod.

Your workflow is OK if you're just going to annotate, but in the end, you'd want to use all those annotations from ner.correct and ner.teach to train a final model (which you can conveniently do via prodigy train ).

There's only one database and one database file, but within that you have datasets (collection of annotations). Typically, you'd use different datasets for different annotation experiments and types (manual, binary, etc.). You can then train from multiple datasets later on.

You can check the --update parameter for ner.correct, it gives you the option to update your model in the loop. But in the end, yes, you'd still want to train a new model (from scratch) using the updated annotations.

Yes it does update, but it's better if you train again from scratch (with the updated annotations) for your production/final model. WIth that, you get all the conveniences of setting training hyperparameters, refining your config, etc. etc.

Hi Lj Miranda,

Thank you for your very thorough answer. I will try out your suggestions and report back with any questions I come up with.

While I understand the workflow that you describe, I have a non-typical situation where the model wouldn't really go into production. This is a proof-of-concept presentation of Prodigy and its capabilities in a very specific clinical concept setting. In the end, my "product" will be detailed instructions tailored to the specifications of a task so that SME's with minimal expertise in ML and NLP maybe be able to use Prodigy for their task.

I'm still have having a hard time understanding the difference between database and dataset. In the video I linked, Ines recommends saving different databases for different parts of the workflow. In that case, will each database have different datasets? Forgive me for my ignorance on this issue.

Hi @sudarshan

  • You can think of a dataset as a "collection of annotations." If you're using Prodigy with its default parameters / configuration, all your datasets will just reside in a single database.
  • A database, on the other hand, can be liken to a central storage of your datasets. This can be a SQLite, MySQL, etc. database based on your config. They hold all your datasets by default. It's a one-to-many relationship :slight_smile:
1 Like