I am wondering whether there is a simple way to learn the polarity of a word ? By polarity I mean the context in which certain key-words appear. For example, working with medical notes, some drugs could be prescribed or only mentioned (but not actually prescribed). The question is how to discriminate between the actually prescribed and those which only have been mentioned.
For example:
['Aspirin has been prescribed to a patients'] -> {[('key_word': 'aspirin', 'prescribed': TRUE)]}
['If symptoms continue, the patient should consider taking Omeprazol'] -> {[('key_word': 'omeprazol', 'prescribed': FALSE)]}
[‘The plan is for him to commence 25mg of Trazodone as soon as he gets better.’] -> {[('key_word': 'trazodone ', 'prescribed': FALSE)]}
['her current meds are: sertraline 200 mg and olanzapine 5 mg'] -> {[('key_word': 'sertraline', 'prescribed': TRUE), ('key_word': 'olanzapine', 'prescribed': TRUE)]}
['if she continues to be depressed, then she needs to be started on Risperidone'] -> {[('key_word': 'risperidone', 'prescribed': FALSE)]}
So, basically, I need to train a model to recognize the intent of certain key words (such as meds) or correctly classify them. I am exploring the tutorial of Ines on insult classification, but it is not exactly as in my case (though similar).
I think it might be misleading to refer to this task as ‘polarity’, as that’s usually applied to sentiment, which isn’t quite what you’re doing here. I think “intent” isn’t the best terminology either, because that’s usually applied to parsing commands.
I think the meanings you’re trying to classify here are pretty subtle, and so the classifier is probably going to learn keywords. You might find that you’re better off doing a rule-based approach, with the rules referring to the dependency parse. This at least lets you control what the system is outputting a bit better.
If you do want a classifier to do the task, probably the best way to structure the model that I can think of is to use a sequence classification model with the categories PRESCRIBED_DRUG and NON_PRESCRIBED_DRUG. spaCy’s default NER model might not be great at this task, as you probably want to use a BiLSTM model instead of spaCy’s CNN.
Whether you use a rule-based approach or try to train a model for whether the drugs are prescribed or not, you should definitely have a good evaluation set annotated, with at least 1000 examples. This way you can compare different approaches, and refine your rules. If you’re doing a rule-based approach, try not to overfit your rules to your annotated data too much. It can help to have a separate training and test set, where you’re allowed to look at the training set, but not at the test set.