I have a problem where I'm trying to train for a clinical testing status that's very easily detected with a token based rule. The only problem is often a sentence will be referring to the status rather than an actual result. In this case the writer is saying they will test or are awaiting a test result. The semantic structure of these is very different to each other. My question would be if this is a good use of NER? I see the problem as similar to detecting Google the organization and google used as a verb.
Do you have an example of a text and the entities you're currently detecting with rules and want to disambiguate?
I'm not sure NER would be the best way to approach this problem. I remember a similar(ish?) case where the goal was to detect symptoms and whether the patient had them or not. So the entity might be "abdominal pain", but the text could talk about the patient reporting it or not reporting it. The problem here is that the concept of "abdominal pain" is the same in both cases – it's just that the semantic relationships are different. Or, to use the Google example: it's more like the company "Google" being used in different contexts and trying to extract only one of them.
Here are some ideas for how to approach this:
- Inspect the constructions and trigger words/phrases that indicate the status vs. the actual result. Here's an example of coming up with and writing these kinds of rules. In some cases, this could be very straightforward, like "lemma: await" → attached to "lemma: result" → preposition → entity. In other cases, it might be a little tricker. You may not be able to solve your entire problem this way, but it'll give you a rule-based baseline that you can evaluate any other approaches against.
- If needed: fine-tune the generic components like the tagger and parser on your data (e.g. using
dep.teach). If those components are accurate, it will make it much easier to use the syntax to extract information.
- If you're mostly dealing with the same context per sentence, try to encode the problem as a text classification task. This can often work very well. For instance, for all sentences containing your entities, label and predict whether it's about a result or not (or which of N statuses applies). Of course, this will only work if sentences typically only talk about one status.
- Try to encode it as a pure NER task and see how you go. Should be quick to annotate if you already have the rules in place. Then compare the results to your rule-based baseline (see above) and the text classification approach.