Annotated Dataset and NER task with Prodigy

sudarshan85 · February 4, 2022, 5:43pm

I'm posting an update here for this task along with a few questions based on some hickups I've been having.
As mentioned in the first post this is a NER task to extract clinical concepts (specifically concepts relating to detox) from medical notes. I can't share the note details here but I'll share relevant task details. I used the video example I linked earlier to guide my process.

detox_data.jsonl -- The original file containing the clinical notes, this contains 15,895 lines corresponding to that many snippets
detox_event_extraction -- Dataset in the database that was created when SME manually annotated the snippets

I launched Prodigy with ner.manual with detox_data.jsonl and a blank:en model to have the SME annotate 1000 documents (over multiple sessions) and save it to the dataset detox_event_extraction using the following command:

prodigy ner.manual detox_event_extraction blank:en detox_data.jsonl --label <labels>

I extracted the annotated dataset to a new jsonl file using the command:

prodigy db-out detox_event_extraction > annotated_snippets.jsonl

For reasons that are not important to this topic, I had to drop the detox_event_extraction dataset and I also ended up deleting the database file detox_event_extraction.db file. After some messing around and with the help of of the post here, I was able to get another database file a re-added the annotated snippets to a new dataset called dee_manual_1000
I then trained a NER model using the scispacy model as a starting point:

prodigy train ner dee_manual_1000 en_core_sci_lg --output dee_manual_1000_model  --eval-split 0.2

The training took a while and the results were not very good with a F1 score of only 20.
5. I ran the train-curve command which showed good improvements as more data was added. So my next step is to have the SME annotate more documents and iteratively train and annotate to get an acceptable performance.

According the tutorial video, my next step was to launch Prodigy with ner.correct:

prodigy ner.correct dee_correct_2000 dee_manual_1000_model detox_data.jsonl --label <labels> --exclude dee_manual_1000

I would like to point out three things:

I want save the new annotations in a separate dataset dee_correct_2000 as suggested by the tutorial
I'm using the already trained model dee_manual_1000_model
I would like to exclude the already annotated snippets, hence I've added the --exclude option with the appropriate dataset name

This is the point I'm running into couple of issues and I have the following questions:

ner.correct seems to be doing sentence segmentation and showing only one sentence at a time. I would like it to show the whole document as it would in ner.manual. How do I achieve this?
Despite adding --exclude option in the original command, I see samples from the original annotations which gives a feeling of "starting from scratch" and unfortunately SME time is valuable. Why is this happening?
What is the sequence of documents that is displayed when running ner.manual vs ner.correct? Do they follow the same sequence as presented in the jsonl file. I know that ner.manual does, but not sure about ner.correct.

Thank for your reading this long post and any help that is provided!

Topic		Replies	Views
Deleting certain annotation sessions usage , database	1	1311	January 20, 2019
Processing annotated data usage , ner	1	310	January 20, 2022
Training on part of the custom annotations usage , ner , database	4	676	October 22, 2021
Annotate text with multiple entities using ner_manual usage , ner	4	876	November 26, 2018
Getting Started Questions usage , ner	1	630	November 6, 2018

Annotated Dataset and NER task with Prodigy

Related topics