I'm posting an update here for this task along with a few questions based on some hickups I've been having.
As mentioned in the first post this is a NER task to extract clinical concepts (specifically concepts relating to detox) from medical notes. I can't share the note details here but I'll share relevant task details. I used the video example I linked earlier to guide my process.
detox_data.jsonl
-- The original file containing the clinical notes, this contains 15,895 lines corresponding to that many snippets
detox_event_extraction
-- Dataset in the database that was created when SME manually annotated the snippets
- I launched Prodigy with
ner.manual
withdetox_data.jsonl
and ablank:en
model to have the SME annotate 1000 documents (over multiple sessions) and save it to the datasetdetox_event_extraction
using the following command:
prodigy ner.manual detox_event_extraction blank:en detox_data.jsonl --label <labels>
- I extracted the annotated dataset to a new
jsonl
file using the command:
prodigy db-out detox_event_extraction > annotated_snippets.jsonl
- For reasons that are not important to this topic, I had to drop the
detox_event_extraction
dataset and I also ended up deleting the database filedetox_event_extraction.db
file. After some messing around and with the help of of the post here, I was able to get another database file a re-added the annotated snippets to a new dataset calleddee_manual_1000
- I then trained a NER model using the scispacy model as a starting point:
prodigy train ner dee_manual_1000 en_core_sci_lg --output dee_manual_1000_model --eval-split 0.2
The training took a while and the results were not very good with a F1 score of only 20.
5. I ran the train-curve
command which showed good improvements as more data was added. So my next step is to have the SME annotate more documents and iteratively train and annotate to get an acceptable performance.
According the tutorial video, my next step was to launch Prodigy with ner.correct
:
prodigy ner.correct dee_correct_2000 dee_manual_1000_model detox_data.jsonl --label <labels> --exclude dee_manual_1000
I would like to point out three things:
- I want save the new annotations in a separate dataset
dee_correct_2000
as suggested by the tutorial - I'm using the already trained model
dee_manual_1000_model
- I would like to exclude the already annotated snippets, hence I've added the
--exclude
option with the appropriate dataset name
This is the point I'm running into couple of issues and I have the following questions:
-
ner.correct
seems to be doing sentence segmentation and showing only one sentence at a time. I would like it to show the whole document as it would inner.manual
. How do I achieve this? - Despite adding
--exclude
option in the original command, I see samples from the original annotations which gives a feeling of "starting from scratch" and unfortunately SME time is valuable. Why is this happening? - What is the sequence of documents that is displayed when running
ner.manual
vsner.correct
? Do they follow the same sequence as presented in thejsonl
file. I know thatner.manual
does, but not sure aboutner.correct
.
Thank for your reading this long post and any help that is provided!