Extracting pharmaceutical active ingredients from clinical notes

I’m starting with Spacy + Prodigy and natural language processing. By the moment I need a very easy task but, to be honest, it is taking too much time. This is the thing:

  • I have a list of ~3000 pharmaceutical active ingredients.
  • I have a lot of clinical notes from several hospitals.
  • I must build a report of the pharmaceutical active ingredients included in the clinical notes.

At the moment, I’m trying to create a new entity “Pharmaceutical Active Ingredient” and train Spacy to learn all of them. But I’m not sure if this is the right way, as what I need to detect is the exact name of the pharmaceutical active ingredients, and maybe the right way could be a match process.

On the other hand, the clinical notes texts are the result of an OCR process over real scanned clinical notes, so the NLP process must be tolerant to, for example, mismatching characters in the name of an active ingredient.

I bought a Prodigy license as I thought this software was the right way to train Spacy to detect the active ingredients, but now I’m a bit lost.

I would really appreciate your help in this issue.

Thanks in advance and best Regards,

Javier Movilla


I think the right way to proceed depends on whether this list of 3000 compounds is exhaustive or not. If it’s an exhaustive list of the compounds you’re looking for, you may be better off doing a simple text matching process rather than building an NER model. (You’d want to incorporate something like an edit distance score to handle bad OCRing or train your own embeddings if you have enough text).

If, on the other hand, this is not an exhaustive list of compounds, then you probably do want to train an NER model. The example video on recognizing drugs on Reddit would be almost exactly your situation.

Thanks for your reply Andy. We can say the compounds list is exhaustive. I tryed to use FuzzyWuzzy to perform the text matching process, but the results are not the expected. And it’s a pitty, because FuzzyWuzzy is suposed to do exactly what I need.

Do you know if there is another library or approach I can use that fits this task?

On the other hand, if I create a new entity for the compounds and I train the system with the compounds list, will spacy find exactly these compounds in a text? I mean, if the text contains “ibuprofen dihydrate”, will the PLN process return only “ibuprofen dihydrate”, or will it return “ibuprofen dihydrate” and “ibuprofen” (that is a valid compound) as well?

Thanks and BR

Javier Movilla

What went wrong when you used FuzzyWuzzy? I was thinking the same thing, that this would be a good solution.

I think you should try using spaCy’s rule-based components, specifically the Matcher and PhraseMatcher. Have a look at the docs here: https://spacy.io/usage/rule-based-matching

Ignoring the OCR issues, the Matcher should get you exactly what you’re looking for. Once you’ve got that working, you could then try to work around the OCR. You might be able to find an additional library that can make useful corrections. Another option is to try training a sentence classification task for a HAS_OCR_ERROR. You could then apply special processing only on sentences that have OCR errors, for instance using the FuzzyWuzzy library, or using a looser set of match patterns. You could also try training an NER model to detect exactly which words are OCR errors. That would make your rules easier to write, but may take too long to annotate — I’m not sure.

Either way, I’d say the first step is to get yourself a first draft system and set up an evaluation on part of your data. Then you can run the evaluation as you change things, and see how your score changes.

An evaluation set is also crucial in explaining your decisions and presenting trade-offs when you present your results. For instance, you might find that your first draft system that does the simplest thing only got 30% of the entities. Then you did something a bit more clever with some other library, and got to 85% accuracy. Then you trained some model and managed to resolve the majority of the remaining cases. It’s always good to be able to say you took accuracy from X% to Y%, and here’s what’s going on with the remaining errors, and why they’ll probably be difficult to fix.

Hi Matthew,

Thanks, I’ll follow your suggestion.


Javier Movilla