Help on annotating negative findings for a new entity


I’ll start with an apology for my lack of knowledge about NLP, ML, spaCy, Prodigy, etc. :slight_smile:

I’ve been trying to create a new entity type I’m labeling FINDINGS along the lines described in the video ‘Training a new entity type on Reddit comments’
Since we have a database of terms for clinical and lab findings I created a jsonl file of a few thousand patterns along the lines of:

{"label": "FINDINGS", "pattern": [{"lower": "seizures"}]}
{"label": "FINDINGS", "pattern": [{"lower": "seizing"}]}
{"label": "FINDINGS", "pattern": [{"lower": "diarrhea"}]}
{"label": "FINDINGS", "pattern": [{"lower": "diarrhoea"}]}
{"label": "FINDINGS", "pattern": [{"lower": "scours"}]}
{"label": "FINDINGS", "pattern": [{"lower": "lumps"}]}
{"label": "FINDINGS", "pattern": [{"lower": "masses"}]}
{"label": "FINDINGS", "pattern": [{"lower": "cough"}]}
{"label": "FINDINGS", "pattern": [{"lower": "coughing"}]}
{"label": "FINDINGS", "pattern": [{"lower": "weight"}, {"lower": "loss"}]}
{"label": "FINDINGS", "pattern": [{"lower": "weight"}, {"lower": "reduction"}]}
{"label": "FINDINGS", "pattern": [{"lower": "losing"}, {"lower": "weight"}]}
{"label": "FINDINGS", "pattern": [{"lower": "increased"}, {"lower": "liver"}, {"lower": "enzymes"}]}
{"label": "FINDINGS", "pattern": [{"lower": "elevated"}, {"lower": "hepatic"}, {"lower": "enzymes"}]}
{"label": "FINDINGS", "pattern": [{"lower": "elevated"}, {"lower": "liver"}, {"lower": "enzymes"}]}
{"label": "FINDINGS", "pattern": [{"lower": "increased"}, {"lower": "hepatic"}, {"lower": "enzymes"}]}
{"label": "FINDINGS", "pattern": [{"lower": "alopecia"}]}
{"label": "FINDINGS", "pattern": [{"lower": "baldness"}]}
{"label": "FINDINGS", "pattern": [{"lower": "bald"}]}
{"label": "FINDINGS", "pattern": [{"lower": "hair"}, {"lower": "loss"}]}
{"label": "FINDINGS", "pattern": [{"lower": "pupd"}]}
{"label": "FINDINGS", "pattern": [{"lower": "hypercalcemia"}]}
{"label": "FINDINGS", "pattern": [{"lower": "elevated"}, {"lower": "calcium"}, {"lower": "levels"}]}
{"label": "FINDINGS", "pattern": [{"lower": "increased"}, {"lower": "calcium"}]}
{"label": "FINDINGS", "pattern": [{"lower": "hypercalcaemia"}]}
{"label": "FINDINGS", "pattern": [{"lower": "vomiting"}]}
{"label": "FINDINGS", "pattern": [{"lower": "emesis"}]}
{"label": "FINDINGS", "pattern": [{"lower": "vomit"}]}
{"label": "FINDINGS", "pattern": [{"lower": "vomition"}]}
{"label": "FINDINGS", "pattern": [{"lower": "pruritus"}]}
{"label": "FINDINGS", "pattern": [{"lower": "itch"}]}
{"label": "FINDINGS", "pattern": [{"lower": "itchiness"}]}
{"label": "FINDINGS", "pattern": [{"lower": "pruritis"}]}
{"label": "FINDINGS", "pattern": [{"lower": "swelling"}]}
{"label": "FINDINGS", "pattern": [{"lower": "swellings"}]}
{"label": "FINDINGS", "pattern": [{"lower": "swollen"}]}
{"label": "FINDINGS", "pattern": [{"lower": "urinary"}, {"lower": "incontinence"}]}
{"label": "FINDINGS", "pattern": [{"lower": "urine"}, {"lower": "dribbling"}]}
{"label": "FINDINGS", "pattern": [{"lower": "involuntary"}, {"lower": "urination"}]}
{"label": "FINDINGS", "pattern": [{"lower": "involuntary"}, {"lower": "micturition"}]}
{"label": "FINDINGS", "pattern": [{"lower": "urine"}, {"lower": "leakage"}]}
{"label": "FINDINGS", "pattern": [{"lower": "urinary"}, {"lower": "leakage"}]}

Intitially I created a new dataset and ran ner.teach against 1000 message board posts like the ones I want to use it on. I used the en_core_web_lg model and the findings patterns above.

As I was going through the annotations process I realized that in addition to all the ones that matched I had a lot like:
excessive thirst and urination
mild increases to his Na and Cl
bleeding from the left nostril
No prior history of bleeding
slight increase in BUN
Liver enzymes were normal
elevated HCT and Platelets
no lameness or pain
no nystagmus or head tilt
heart and lung sounds are normal
no longer tachypneic
no vomiting or diarrhea
normal behavior and appetitie
no apparent neck pain or other neurologic deficits
wasn’t swollen
increased drinking, and licking of objects
lung sounds if anything are muffled

I wasn’t sure how to handle those or now the model being trained would handle them
No lameness is a completely different finding than lameness.
No or another term at the beginning of a sentence was used when it applied to a number of findings eg., No lameness or pain

I have listened to a few of the videos and did hear the part about try and try again but was still hoping you might have some hints?

I did go back and spent all afternoon manually labeling the no with the finding with pretty dismal numbers on the training. I am regrouping to clean up the findings patterns and try adding more message posts to teach and train on to see if that might help.

I appreciate your help and all the work you have done on spaCy prodigy.


It sounds to me like your classification scheme might make it quite difficult for the NER system. The NER model was really tuned for named entities, where the boundaries of the entity are very important. It’s a model that’s well-tuned for finding phrases which start with clear features like capital letters.

I think the distinction in your data between a phrase like “no appetite” and “no lameness” is going to be very difficult for the model. If I understand correctly, the classification of a phrase of the form “no X” will depend on how normal “X” is. So in order to solve the classification task, the model needs to get to the word “no”, draw in information from the head word X, and know whether X is normal. That’s really difficult.

Is it possible to use the dependency parse more? Have a look at the docs for it here:

I’m hoping you can use mostly word lists, and write dependency parse rules to get the larger phrase boundaries. I think word sense ambiguity should be fairly low in your domain, so if you have the right word lists, you might not need to train a context-sensitive model.

One thing you’ll probably want to do is use the ner.manual recipe to get yourself an unbiased, manually annotated evaluation set. You probably also want to keep a pen and paper nearby and just jot down notes about what sort of patterns you’ll need to extract as you go.

To make this a bit more concrete: We recently added a "models and rules" section to the spaCy docs that shows some examples of how to use statistical predictions and rules for more complex information extraction tasks, similar to what you're trying to achieve:

You might also want to look into the new EntityRuler component, which takes patterns in the same format as Prodigy and assigns entities based on those pattern rules. It can also be combined with the statistical named entity recognizer.

Thank you both for helping me understand better how to look at and think about NER and start to figure out what dependency parsing and rule-based maching are and how they fit into the picture. Getting me pointed in a better direction is much appreciated :slight_smile:

So much for a quick proof of concept :slight_smile: Lots to learn, try, and fail at but prodigy should be a big help.

Thanks again,