Hi,
I’ll start with an apology for my lack of knowledge about NLP, ML, spaCy, Prodigy, etc.
I’ve been trying to create a new entity type I’m labeling FINDINGS along the lines described in the video ‘Training a new entity type on Reddit comments’
Since we have a database of terms for clinical and lab findings I created a jsonl file of a few thousand patterns along the lines of:
{"label": "FINDINGS", "pattern": [{"lower": "seizures"}]}
{"label": "FINDINGS", "pattern": [{"lower": "seizing"}]}
{"label": "FINDINGS", "pattern": [{"lower": "diarrhea"}]}
{"label": "FINDINGS", "pattern": [{"lower": "diarrhoea"}]}
{"label": "FINDINGS", "pattern": [{"lower": "scours"}]}
{"label": "FINDINGS", "pattern": [{"lower": "lumps"}]}
{"label": "FINDINGS", "pattern": [{"lower": "masses"}]}
{"label": "FINDINGS", "pattern": [{"lower": "cough"}]}
{"label": "FINDINGS", "pattern": [{"lower": "coughing"}]}
{"label": "FINDINGS", "pattern": [{"lower": "weight"}, {"lower": "loss"}]}
{"label": "FINDINGS", "pattern": [{"lower": "weight"}, {"lower": "reduction"}]}
{"label": "FINDINGS", "pattern": [{"lower": "losing"}, {"lower": "weight"}]}
{"label": "FINDINGS", "pattern": [{"lower": "increased"}, {"lower": "liver"}, {"lower": "enzymes"}]}
{"label": "FINDINGS", "pattern": [{"lower": "elevated"}, {"lower": "hepatic"}, {"lower": "enzymes"}]}
{"label": "FINDINGS", "pattern": [{"lower": "elevated"}, {"lower": "liver"}, {"lower": "enzymes"}]}
{"label": "FINDINGS", "pattern": [{"lower": "increased"}, {"lower": "hepatic"}, {"lower": "enzymes"}]}
{"label": "FINDINGS", "pattern": [{"lower": "alopecia"}]}
{"label": "FINDINGS", "pattern": [{"lower": "baldness"}]}
{"label": "FINDINGS", "pattern": [{"lower": "bald"}]}
{"label": "FINDINGS", "pattern": [{"lower": "hair"}, {"lower": "loss"}]}
{"label": "FINDINGS", "pattern": [{"lower": "pupd"}]}
{"label": "FINDINGS", "pattern": [{"lower": "hypercalcemia"}]}
{"label": "FINDINGS", "pattern": [{"lower": "elevated"}, {"lower": "calcium"}, {"lower": "levels"}]}
{"label": "FINDINGS", "pattern": [{"lower": "increased"}, {"lower": "calcium"}]}
{"label": "FINDINGS", "pattern": [{"lower": "hypercalcaemia"}]}
{"label": "FINDINGS", "pattern": [{"lower": "vomiting"}]}
{"label": "FINDINGS", "pattern": [{"lower": "emesis"}]}
{"label": "FINDINGS", "pattern": [{"lower": "vomit"}]}
{"label": "FINDINGS", "pattern": [{"lower": "vomition"}]}
{"label": "FINDINGS", "pattern": [{"lower": "pruritus"}]}
{"label": "FINDINGS", "pattern": [{"lower": "itch"}]}
{"label": "FINDINGS", "pattern": [{"lower": "itchiness"}]}
{"label": "FINDINGS", "pattern": [{"lower": "pruritis"}]}
{"label": "FINDINGS", "pattern": [{"lower": "swelling"}]}
{"label": "FINDINGS", "pattern": [{"lower": "swellings"}]}
{"label": "FINDINGS", "pattern": [{"lower": "swollen"}]}
{"label": "FINDINGS", "pattern": [{"lower": "urinary"}, {"lower": "incontinence"}]}
{"label": "FINDINGS", "pattern": [{"lower": "urine"}, {"lower": "dribbling"}]}
{"label": "FINDINGS", "pattern": [{"lower": "involuntary"}, {"lower": "urination"}]}
{"label": "FINDINGS", "pattern": [{"lower": "involuntary"}, {"lower": "micturition"}]}
{"label": "FINDINGS", "pattern": [{"lower": "urine"}, {"lower": "leakage"}]}
{"label": "FINDINGS", "pattern": [{"lower": "urinary"}, {"lower": "leakage"}]}
Intitially I created a new dataset and ran ner.teach against 1000 message board posts like the ones I want to use it on. I used the en_core_web_lg model and the findings patterns above.
As I was going through the annotations process I realized that in addition to all the ones that matched I had a lot like:
excessive thirst and urination
mild increases to his Na and Cl
bleeding from the left nostril
No prior history of bleeding
slight increase in BUN
Liver enzymes were normal
elevated HCT and Platelets
no lameness or pain
no nystagmus or head tilt
heart and lung sounds are normal
no longer tachypneic
no vomiting or diarrhea
normal behavior and appetitie
no apparent neck pain or other neurologic deficits
wasn’t swollen
increased drinking, and licking of objects
lung sounds if anything are muffled
I wasn’t sure how to handle those or now the model being trained would handle them
No lameness is a completely different finding than lameness.
No or another term at the beginning of a sentence was used when it applied to a number of findings eg., No lameness or pain
I have listened to a few of the videos and did hear the part about try and try again but was still hoping you might have some hints?
I did go back and spent all afternoon manually labeling the no with the finding with pretty dismal numbers on the training. I am regrouping to clean up the findings patterns and try adding more message posts to teach and train on to see if that might help.
I appreciate your help and all the work you have done on spaCy prodigy.
Thanks,
Nicky