Hi , i am newbie in NLP with python .
I want to find a solution to extract information from medical dataset using rules based approaches , actually my important informations are located after specefic negative sentences like : no sign of , denies , does not contain … i have large list of the negation sentences…
what method should i use for better performance of the extraction task?
Thank you!
Hey,
Negation can actually be surprisingly subtle. There’s the main ways of expressing it with “no” and “not”, but there are also lots of words and larger patterns that also indicate negation (e.g. ‘lack of’). There’s also scoping questions, like “no absence” that can occur.
You might try training a binary per-sentence model using the textcat.teach
recipe. A patterns file might be useful as well. Your classes will be quite imbalanced, as most sentences aren’t negation, so you’ll probably need a few thousand sentences to train the model accurately. However, I would expect the data to be pretty quick to annotate, maybe around 500 sentences per hour? Give it a try and see how you go.
I’m in the same situation. I’m working with a medical journal, and I have three categories; confirmed diagnosis, uncertain diagnoses and negated diagnosis. So I’ve created these three labels, but the model learns slowly. Then I thought I might be better to do as Matthew propose, but consider the following: “The patient has infarct but no ventricular abnormalities” There are two disorders but one is confirmed, and the other one is negated in one sentence. Any good advice?
Maybe you could experiment with detecting disorders first, and then use the syntax to resolve the diagnosis type? For example, with a combination of statistical NER and a dictionary, you could probably build a decent system to detect "infarct" and "ventricular abnormalities". You can then write rules that take advantage of the dependency parse and how the disorders and similar entities are connected to the rest of the sentence to determine whether it's negated, uncertain and so on.
It might sound trivial, but depending on your data, you might be able to resolve a large number of "negated diagnosis" relationships by simply looking at the previous token and whether it's "no". You might also be able to determine other grammatical constructs that are very strong indicators of certain types of diagnoses. Even out-of-the-box, you should get a pretty decent accuracy for part-of-speech tags and syntactic dependencies – and if not, you can use Prodigy to tweak the model on your data and improve the results of your rules. You could also combine this with your existing text classifier – maybe it turns out that it's pretty good for some cases, so you can rely on it for the very high confidence predictions and handle the rest by your other system.
See here for more details how to extract dependency relationships with spaCy and this page for how to write token-based match patterns to extract certain information. Also, here are some threads that discuss similar problems and approaches:
Thank you for your response
I will give the dependency parser a try (actually I already tried the demo on the explosion webpage, with some examples, and I think it could work).