I have a problem where I'm trying to train for a clinical testing status that's very easily detected with a token based rule. The only problem is often a sentence will be referring to the status rather than an actual result. In this case the writer is saying they will test or are awaiting a test result. The semantic structure of these is very different to each other. My question would be if this is a good use of NER? I see the problem as similar to detecting Google the organization and google used as a verb.
Do you have an example of a text and the entities you're currently detecting with rules and want to disambiguate?
I'm not sure NER would be the best way to approach this problem. I remember a similar(ish?) case where the goal was to detect symptoms and whether the patient had them or not. So the entity might be "abdominal pain", but the text could talk about the patient reporting it or not reporting it. The problem here is that the concept of "abdominal pain" is the same in both cases – it's just that the semantic relationships are different. Or, to use the Google example: it's more like the company "Google" being used in different contexts and trying to extract only one of them.
Here are some ideas for how to approach this:
- Inspect the constructions and trigger words/phrases that indicate the status vs. the actual result. Here's an example of coming up with and writing these kinds of rules. In some cases, this could be very straightforward, like "lemma: await" → attached to "lemma: result" → preposition → entity. In other cases, it might be a little tricker. You may not be able to solve your entire problem this way, but it'll give you a rule-based baseline that you can evaluate any other approaches against.
- If needed: fine-tune the generic components like the tagger and parser on your data (e.g. using
pos.teach
anddep.teach
). If those components are accurate, it will make it much easier to use the syntax to extract information. - If you're mostly dealing with the same context per sentence, try to encode the problem as a text classification task. This can often work very well. For instance, for all sentences containing your entities, label and predict whether it's about a result or not (or which of N statuses applies). Of course, this will only work if sentences typically only talk about one status.
- Try to encode it as a pure NER task and see how you go. Should be quick to annotate if you already have the rules in place. Then compare the results to your rule-based baseline (see above) and the text classification approach.
Hi @ines thanks for you response here are a couple of examples.
PDL1 References:
"check point inhibition with an anti PD 1 mab could be utilized in the future if"
"Requested PD L1 stain to tumor bx sample from Aug"
"Will request PDL 1 testing and if >1% plan to start single"
Here are examples statues
"The only test result on her extended testing was PDL 1 which was 0"
"PDL1 testing negative, 0%"
"PDL1 testing negative, <1%"
Can NER be trained to be sensitive to the sentence semantics? Many of the testing references are future tense or don't have an actual score. What scenarios is it better to employ Text Classification over NER?
I tried to get dep.teach working but it wouldn't accept my input annotations is there detailed documentation on what it expects?
Those are definitely things you can use in your rule-based baseline. So essentially, see how far you get if you just extract the information based on the tense and whether the Doc
contains scores. This gives you an accuracy number that you need to beat (because without a baseline, you're often just shooting randomly in the dark).
NER typically works well for context-dependent "categories of things", proper nouns etc. For example, the concept of "PDL1". But it's no always a good idea to also try and encode another dimension here – e.g. if the cocnept of "PDL1" is good or bad, or whatever else it means. This dimension is something you could try to predict on the sentence level – e.g. whether the sentence discusses something positive or something with the meaning X.
You might find @honnibal's talk on solving different NLP problems helpful. It also has a few good examples:
Ultimately, these are things you need to try out on your specific problem and iterate on until you find an approach (or combination of approaches) that works best.
You can find the detailed format in the "Input formats" or "Annotation taks formats" section of your PRODIGY_README.html
. dep.teach
should just need a raw input file (txt, JSONL etc.) and a pretrained model with a parser.