Annotated Dataset and NER task with Prodigy

Hello,

I'm new to using Prodigy and have a few questions. Our team is working on a task to extract detoxification events from clinical notes written by healthcare providers. We define our custom label for each detox event and we approaching this as a named entity recognition task for automatically labeling certain events based on SME guidance.

Currently, we are working with 15095 snippets. We've had a SME use Prodigy with the following command:

prodigy ner.manual detox_event_extraction blank:en data_detox.jsonl --label <labels>

The actual labels we use is not important. The SME has annotated 1000 snippets out of the 15095 snippets and saved it a database file called detox_event_extraction.db. I've extracted the those 1000 annotated snippets in a separate json file annotated_snippets.jsonl. I have the following questions:

  1. What information does the database actually hold? Specifically, does it hold only annotated dataset or does it hold the entire dataset?
  2. Due to some issues, I had to delete the database file. But I have the 1000 annotated snippets (annotated_snippets.jsonl) and the original data file (data_detox.jsonl). How do I recreate the database file using these files so that if needed the SME can continue annotating from snippet 1001?

The workflow that I'd like to follow is similar to the ingredients NER video by Ines Montani. Specifically, I want to use the 750 out of the 1000 annotated snippets to train a spacy model (250 for eval) and then use a ner.correct recipe with the SME to annotate a further 1000 notes. Eventually, I want to use ner.teach for active learning with the SME to train the model with another few 1000 notes. I'm saving a final 5000+ snippets as a final test set. I have the following questions:

  1. Is what I have described a typical workflow for the task that I want done?
  2. I'm having a hard time understanding how the database file incorporates all the annotations provided by ner.manual, ner.correct, and ner.teach. Do I save different database files for each session?
  3. Does the model continuously update with ner.correct? If not, do I just train a new model with updated annotations?
  4. Does the model continuously update with ner.teach? Is there an example how I could go about doing this?

I apologize for the long post and would appreciate any help regarding thes questions.

Thanks!

Hi @sudarshan !

It only holds the annotated dataset. Whenever you're done annotating and once you "saved" your annotations, their values (whether you accepted, rejected, or ignored them) will be saved into the db.

You can use the db-in command. Prodigy will skip annotations that were already in the dataset and you can just keep annotating with the original file :slight_smile:

Oops my reply got cut, here are the answers for the next set of questions :slight_smile: Thanks for listing them in an accessible manner, anyway! :smiley:

The typical workflow is to annotate everything, then do a final training with all of your annotated data. Ideally, we'd only use ner.teach to improve the quality-of-life of our annotation process, we won't use it to train a model that goes to prod.

Your workflow is OK if you're just going to annotate, but in the end, you'd want to use all those annotations from ner.correct and ner.teach to train a final model (which you can conveniently do via prodigy train ).

There's only one database and one database file, but within that you have datasets (collection of annotations). Typically, you'd use different datasets for different annotation experiments and types (manual, binary, etc.). You can then train from multiple datasets later on.

You can check the --update parameter for ner.correct, it gives you the option to update your model in the loop. But in the end, yes, you'd still want to train a new model (from scratch) using the updated annotations.

Yes it does update, but it's better if you train again from scratch (with the updated annotations) for your production/final model. WIth that, you get all the conveniences of setting training hyperparameters, refining your config, etc. etc.

Hi Lj Miranda,

Thank you for your very thorough answer. I will try out your suggestions and report back with any questions I come up with.

While I understand the workflow that you describe, I have a non-typical situation where the model wouldn't really go into production. This is a proof-of-concept presentation of Prodigy and its capabilities in a very specific clinical concept setting. In the end, my "product" will be detailed instructions tailored to the specifications of a task so that SME's with minimal expertise in ML and NLP maybe be able to use Prodigy for their task.

I'm still have having a hard time understanding the difference between database and dataset. In the video I linked, Ines recommends saving different databases for different parts of the workflow. In that case, will each database have different datasets? Forgive me for my ignorance on this issue.

Hi @sudarshan

  • You can think of a dataset as a "collection of annotations." If you're using Prodigy with its default parameters / configuration, all your datasets will just reside in a single database.
  • A database, on the other hand, can be liken to a central storage of your datasets. This can be a SQLite, MySQL, etc. database based on your config. They hold all your datasets by default. It's a one-to-many relationship :slight_smile:
1 Like

I'm posting an update here for this task along with a few questions based on some hickups I've been having.
As mentioned in the first post this is a NER task to extract clinical concepts (specifically concepts relating to detox) from medical notes. I can't share the note details here but I'll share relevant task details. I used the video example I linked earlier to guide my process.

detox_data.jsonl -- The original file containing the clinical notes, this contains 15,895 lines corresponding to that many snippets
detox_event_extraction -- Dataset in the database that was created when SME manually annotated the snippets

  1. I launched Prodigy with ner.manual with detox_data.jsonl and a blank:en model to have the SME annotate 1000 documents (over multiple sessions) and save it to the dataset detox_event_extraction using the following command:
prodigy ner.manual detox_event_extraction blank:en detox_data.jsonl --label <labels>
  1. I extracted the annotated dataset to a new jsonl file using the command:
prodigy db-out detox_event_extraction > annotated_snippets.jsonl
  1. For reasons that are not important to this topic, I had to drop the detox_event_extraction dataset and I also ended up deleting the database file detox_event_extraction.db file. After some messing around and with the help of of the post here, I was able to get another database file a re-added the annotated snippets to a new dataset called dee_manual_1000
  2. I then trained a NER model using the scispacy model as a starting point:
prodigy train ner dee_manual_1000 en_core_sci_lg --output dee_manual_1000_model  --eval-split 0.2

The training took a while and the results were not very good with a F1 score of only 20.
5. I ran the train-curve command which showed good improvements as more data was added. So my next step is to have the SME annotate more documents and iteratively train and annotate to get an acceptable performance.

According the tutorial video, my next step was to launch Prodigy with ner.correct:

prodigy ner.correct dee_correct_2000 dee_manual_1000_model detox_data.jsonl --label <labels> --exclude dee_manual_1000

I would like to point out three things:

  1. I want save the new annotations in a separate dataset dee_correct_2000 as suggested by the tutorial
  2. I'm using the already trained model dee_manual_1000_model
  3. I would like to exclude the already annotated snippets, hence I've added the --exclude option with the appropriate dataset name

This is the point I'm running into couple of issues and I have the following questions:

  1. ner.correct seems to be doing sentence segmentation and showing only one sentence at a time. I would like it to show the whole document as it would in ner.manual. How do I achieve this?
  2. Despite adding --exclude option in the original command, I see samples from the original annotations which gives a feeling of "starting from scratch" and unfortunately SME time is valuable. Why is this happening?
  3. What is the sequence of documents that is displayed when running ner.manual vs ner.correct? Do they follow the same sequence as presented in the jsonl file. I know that ner.manual does, but not sure about ner.correct.

Thank for your reading this long post and any help that is provided!

hi @sudarshan85!

Sorry for the delay. We're trying to close out old tickets.

By default, ner.correct does sentence segmentation (unlike ner.manual. You can turn it off by adding --unsegmented.

That's tough to confirm. Let me go through a reproducible example of what should happen.

Start with this source file:
nyt_text_dedup.jsonl (18.5 KB)

Step 1: Label 10 records into dataset ner_correct1

python -m prodigy ner.correct ner_correct1 en_core_web_sm nyt_text_dedup.jsonl --label LOC

I then labeled the first 10 records. You can see them by running:

$ python -m prodigy print-dataset ner_correct1

Step 2: Rerun but use --exclude to exclude records in ner_correct1

python3 -m prodigy ner.correct ner_correct2 en_core_web_sm data/nyt_text_dedup.jsonl --exclude ner_correct1 --label LOC

Notice it starts on record 10 (see metadata in bottom right). Therefore, it skipped the first 10 records.

Yes. ner.manual and ner.correct will use based on order of documents. This is different than ner.teach, which uses active learning and will alter the order of the documents based on uncertainty scoring.

Let us know if you have any other questions!