Clinical Note Extraction

Hi,

I'm a beginner to Prodigy and I have recently trained a NER model, it is definite a great tool!

But I do have several questions:

I'm working on clinical note mining to extract information (medication, dosage, medical condition, medical procedure, medical device, hemodynamic measurements). It works well so far and gives a decent f-score. To further improve and refine the model, I would like to extract additional information: vessels which have a condition, receive a procedure, or treated/diagnosed by a device. And here are some examples:

'thrombotic occlusion at the LIMA/LAD anastomosis' -> thrombotic occlusion is a medical condition, and LIMA and LAD are the vessels having the condition
'A stent was inserted into LAD' -> stent is a device, and LAD is the vessel being treated with stent
'The LAD and diagonal were revascularized with balloon angioplasty and stent' -> balloon angioplasty is procedure and stent is a device, LAD and diagonal are the vessels receiving the procedure and the device

Question 1: In order to add vessel information, should I add another entity 'vessel' and use dependency parser to extract the relation with entities like procedure, condition and device? Or should I highlight the entity spans including vessel like 'thrombotic occlusion at the LIMA/LAD ' and extract common vessel names after that?

Question 2: What might be the best way to identify negations in the notes like 'Heart transplant was delayed' or 'No bleeding or hematoma'?

Question 3: There are terms that can fall in to both entities, and I'm wondering what I should do to avoid the confusion. For example:
balloon, ballooned, angioplasty, balloon angioplasty -> balloon is a device, angioplasty is a procedure using balloon, then which entity should I assign ballooned and balloon angioplasty?

Thanks in advance!

Welcome to the forum @wjin :wave:

It's great to hear you like using Prodigy:)

Thank you for providing ample context to your questions - that definitely helps to understand the use case.

Question 1: In order to add vessel information, should I add another entity 'vessel' and use dependency parser to extract the relation with entities like procedure, condition and device? Or should I highlight the entity spans including vessel like 'thrombotic occlusion at the LIMA/LAD ' and extract common vessel names after that?

The first approach you describe i.e. adding a new entity vessel and resolving its relation to other entities (in this case location) via dependency rules would a much better approach than postprocessing longer spans.
There are several reasons, NER and span categorizer are very sensitive to lexical features. I imagine these long spans would have a lot of internal variation that would make it harder for the model to distinguish them from other spans. Also, it might be that the procedure and the location will not appear in a consecutive span which would make it impossible to annotate. Also, extracting the vessel names post-hoc i.e based on rules, will require a very complete ontology of vessel names that will need constant maintenance.
In general, NER works best with entities with clearly defined boundaries and the semantic relation of location should be relatively easy to capture via dependency rules.

Question 2: What might be the best way to identify negations in the notes like 'Heart transplant was delayed' or 'No bleeding or hematoma'?

The recommended way would be to annotate the token that expresses the negation as a new entity e.g. neg and use relations to link it to the negated entity. Once you have such annotations, you can build a dependency matcher rules from them. neg would be just be an auxiliary entity to support the dependency rules creation - you wouldn't be training it in the NER phase as there's not really a need for it (and probably not enough examples anyway).

Question 3: There are terms that can fall in to both entities, and I'm wondering what I should do to avoid the confusion. For example:
balloon, ballooned, angioplasty, balloon angioplasty -> balloon is a device, angioplasty is a procedure using balloon, then which entity should I assign ballooned and balloon angioplasty?

In these cases we should be leveraging the compound nature of the English language and annotate "atomic" entities. The modfier-modified relation between entities should be captured by dependency rules. This should help to avoid the lexical overlap between categories that could definitely lead to confusion. In the example you provided I would recommend annotating ballon as device and angioplasty as procedure and I would link the two with a modifier relation. As for the balooned it would be ideal to have it as device as well, but I imagine it could be considered a procedure? You would have judge by looking at an example sentence and if a prediction with balooned as device would be correct for the downstream task, then it should be fine. If it needs to be a procedure and the context is sufficiently different and there are enough examples, the model might also learn it but it would be easier if it could be interpreted as device, for sure.

Hope that helps! Do let us know if you have any follow up questions :slight_smile:

1 Like

Thank you so much for the detailed instructions! I still have a few questions after my discussion with my team members:

Question 1:

I totally agree with the idea of adding a new entity for negation phrases, but how do I exclude the auxiliary entities like negation during prodigy train?

Question 2:
Since manual annotation of clinical notes can be very time consuming, we're thinking about incorporating LLM via spacy-LLM that can help us improving efficiency and accuracy of annotation. We're cautious about the sensitive data sending to third-party's servers, so we would like to stick with local LLM first. Do you have any recommended local LLM applicable to clinical notes?

Question 3:
I want to get a sense of how my team members annotate clinical note, so I would like to get 4 of them to annotate 50 same clinical notes and I will review them. What might be the most direct the easiest way to set this up but keep the data safe at the same time (I know ngrok could work but the data will be sent to a public link)

Sorry these are a lot of questions, thanks in advance!

Hi @wjin!

Sure thing, answers inline:

how do I exclude the auxiliary entities like negation during prodigy train?

For that we'd have to a bit of Python scripting to, essentially, create a new dataset stripped of NEG spans. First, you'd need to export the dataset with the db-out command and store it on disc:

python -m prodigy db-out ner_with neg ./prodigy_datasets

This will save the dataset ner_with_neg as a .jsonl file at ./prodigy_datasets/ner_with_neg.jsonl
We'll use it as input to the script that filters out unwanted spans and saves a new dataset on disc:

import copy
import srsly

examples = srsly.read_jsonl("./prodigy_datasets/ner_dataset_with_neg.jsonl")

updated_examples=[]
for eg in examples:
    new_eg = copy.deepcopy(eg)
    new_spans = []
    spans = eg.get("spans")
    for span in spans:
        if span["label"]=="NEG":
            continue
        else:
            new_spans.append(span)
    new_eg["spans"] = new_spans
    updated_examples.append(new_eg)

srsly.write_jsonl("ner_dataset_wo_neg.jsonl", updated_examples)

And that will be the dataset to use for training. You can load it to the db with the db-in command:

python -m prodigy db-in ner_dataset_wo_neg ner_dataset_wo_neg.jsonl

Do you have any recommended local LLM applicable to clinical notes?

Unfortunately, I don't have hands-on experience with any LLMs on clinical data. One of my colleagues suggests taking a look at Meditron LLM (paper). It probably would be best shortlist 2-3 candidates and do some comparison of the performance on your particular data. Be mindful though that arbitrary OS/HF modes are not supported by spacy-llm. You might need to try leverage langchain support or, worst case, write a custom wrapper for the mode.l

What might be the most direct the easiest way to set this up but keep the data safe at the same time (I know ngrok could work but the data will be sent to a public link)

With the personal license you could set up basic HTTP authorization via PRODIGY_BASIC_AUTH_USER and PRODIGY_BASIC_AUTH_PASS environment variables.
These, for example, could precede the invoke of the Prodigy command:

PRODIGY_BASIC_AUTH_USER="my_annotator" PRODIGY_BASIC_AUTH_PASS="my_pass" python -m prodigy ...

Then, when accessing the server, the annotator will be asked to provide these credentials. Additionally, you could limit the allowed names via PRODIGY_ALLOWED_SESSIONS.

Prodigy Company license comes with more advanced we have a guide on how to set it up in out docs here. (btw. we are currently offering discounts on upgrades to Company license if you're interested)

1 Like