Prodigy + spaCy for negation extraction and a link between the entities

First, many thanks for your hard work on Prodigy and spaCy, they are absolutely amazing products for rapid text mining and NLP modelling!

I have two questions:

I am wondering, whether you can advise on more efficient implementation with Prodigy and spaCy of identifying negations in texts. More specifically, I work on medical text mining and interested in information extraction (drug names, doses, diagnoses, etc). I have a large list (a vocabulary) of drug names and use simple rules (with regex) to extract drugs from patientsā€™ notes as prescribed drugs. It works well so far. However, some drugs are mentioned in the texts as non-tolerable and thus should not be extracted. For example, consider a short text snippet:

text_0 = ā€˜John has developed sides effects while using trazodone 75 mg, therefore I will prescribe a combination of Citalopram 50mg and 25 mg of Fluoxetineā€™

Question 1: I need to extract only actual drugs -> {ā€˜Citalopramā€™, ā€˜Fluoxetineā€™} and to label ā€˜Trazodoneā€™ as ā€˜problematicā€™. Is it possible to directly annotate with Prodigy these drugs and train a model? (ā€˜prescribedā€™ and ā€˜problematicā€™ tags).

Question 2: Also, I need to extract the doses of drugs, however, their pattern is not standardised. It may appear right after the drug name, or it could be mentioned anywhere near the drug name.

While being very excited about NERing with Prodigy, Iā€™ve been thinking about tagging drug names and their doses. However, it is unclear to me, how to specify a link between doses and drug names (for example, 25mg corresponds to Fluoxetine and 50 mg to Citalopram and not vise versa).

Thanks in advance!

1 Like

Hi Andrey,

I think the dependency parser will probably be useful for your task. Have a look at the analysis produced for the example you gave: https://explosion.ai/demos/displacy?text=John%20has%20developed%20sides%20effects%20while%20using%20trazodone%2075%20mg%2C%20therefore%20I%20will%20prescribe%20a%20combination%20of%20Citalopram%2050mg%20and%2025%20mg%20of%20Fluoxetine&model=en_core_web_sm&cpu=0&cph=0

The parse has some errors: especially, the dosage ā€œ75 mgā€ is attached incorrectly. The parser has attached it to ā€œusingā€, when the correct attachment is to ā€œtrazodoneā€. These errors make extracting the correct relations a little more complicated, as the rules become a bit more hacky ā€“ itā€™s still useful though, and generally better than working from the raw text. You can also try to improve the parser on your domain using the dep.teach recipe, although this is still a bit experimental.

Often a good solution is to develop a list of template rules that activate a relation under some construction. For instance, you might have a rule for trigger verbs like ā€œprescribeā€, that says ā€œIf there are any DRUG entities as dobj relations of this verb, add them as actual drugsā€. Another rule could negate trigger verbs. You would attach it to a word like ā€œnotā€, and it would say ā€œif this word attaches to a trigger verb, block extraction of its argumentsā€. Youā€™d then have another template for noun phrase cases, like ā€œprescription of Trazodoneā€.

There will probably only be a few of these templates, each only triggered by a handful of words. If the template is too ambiguous, you can learn whether it should apply to a given context. For instance, letā€™s say you have an example like ā€œTrazodone 75mg. Symptoms persist.ā€. Itā€™s hard to get this right with rules, so you might want to have an additional model that learns a tag like PRESCRIBED here, which is only trained on DRUG examples.

The general strategy is to use the machine learning to add annotations that give you better properties to write rules against. Thereā€™s a balance between how easy the rules are to write, and how easy the annotations are to learn. If you have a simple rule that depends on long-range relations, thatā€™s probably difficult for a model to learn ā€” while the rule will be easy to write. On the other hand, statistical models are good at accumulating lots of small pieces of evidence, which are difficult to put together into effective rules.

In order to do this effectively, I would definitely recommend making an evaluation set you can test the whole process against. The ner.manual recipe is probably best for this. For relations, one notation you could use is to annotate the whole relation, starting from the beginning of the first entity and ending at the last word of the second entity. You would then make a quick second pass to annotate the DRUG and DOSE spans within this parent span.

Hi Matt,

Brilliant, a massive thank you for your detailed answer and the ideas! I am reading the Prodigy manual and experimenting with various recipes. Iā€™ve been thinking about the dependency parser, but was not sure how to best apply it in this context. I will experiment with your suggestions.