Train Snomed medical concepts


I’m using Spacy with Prodigy to extract relevant information from medical notes. At the moment, I got to create a json file with the ~1,5 million Snomed concepts in Spanish. The file has ~1,5M lines and looks like this:

{"label":producto","pattern":[{"lower":"panecillo sin gluten (producto)"}]}
{"label":producto","pattern":[{"lower":"furosemida, 10 mg/ml, solucion inyectable, ampolla de 5 ml (producto)"}]}
{"label":sustancia","pattern":[{"lower":"anticuerpo de grupo sanguineo Westerlund (sustancia)"}]}
{"label":organismo","pattern":[{"lower":"larva del genero Schistosoma (organismo)"}]}

There are 75 different labels in the file. Each one is a different entity in the nlp process.

In order to start training the system, which are the next steps? Is it possible to train the 75 entities using just this json file?

I already have health clinical records on which I could train the system. Should I use them to train the medical entities?

I would really appreciate a step by step answer, as I am no spaCy / Prodigy expert.

Thanks a lot and best regards,

Javier Movilla

Hi! There are a few things here – first, there's a small problem with your patterns:

{"label":producto","pattern":[{"lower":"panecillo sin gluten (producto)"}]}

There's a " missing before producto, which would make the line invalid JSON. Also, each dict in the pattern is supposed to describe one token. So "panecillo sin gluten (producto)" would try to match a token whose exact value is that string, which would likely never be true. That phrase would probably be split into several tokens. You can find more examples and background on token-based patterns here.

Another thing to consider: The phrases you have in your patterns are incredibly specific. I'm not sure how helpful they'd be if you're using them directly to find examples and candidates in your data. "furosemida, 10 mg/ml, solucion inyectable, ampolla de 5 ml (producto)" will find you this exact phrase. I'm no domain expert, so I don't know – but how common is this exact phrase really going to be in your data?

The idea of patterns is to help you find relevant examples for annotation. Something like [{"is_digit": True}, {"lower": "ml"}] for instance could be a useful pattern to find quantities like "10 ml" or "5ML". "Schistosoma" could be a pattern, too, if you're looking for medical terminology.

None of the strings in your patterns are things that would typically be considered "named entities", so if you tried to train a model on those types of phrases in context, you probably wouldn't see very good results.

It might make sense to take a step back here and ask yourself: What exactly are you trying to achieve? What do you want your system to produce? Do you want to extract medical terminology? Do you want to map incoming texts to unique Snomed identifiers?

Training a statistical model can be useful if you want to be able to generalise based on examples of mentions in context. For example, if you wanted to train a system to recognise drug names, you could show it lots of examples of drug names mentioned in different contexts. Highlighting those drug names in your data is tedious, so if you already have a dictionary of drug names, you can convert it to patterns and use it to pre-select those names so you have to do it all by hand. At the end of it, you'll then have a large dataset of texts and the containing entity spans.

If you have large dictionaries of terms and you want to extract them from text, you might find that a rule-based approach actually works much better. It's more predictable and can achieve very comparable accuracy. You might want to check out spaCy's new EntityRuler for this:

Especially if you're new to NLP, starting with a rule-based approach might really be a good idea. It'll give you the quickest results and you'll be able to get a good feeling for your data and for what's easy vs. what's difficult. Once you have a set of rules in place that works, you can always experiment with a statistical model later on, and use your existing rules to bootstrap it.

Thanks Ines for your fast answer. I started with the EntityRuler approach and yes, I think this is the right way to start.

My problem now are the results of the nlp process, as they are not what I expect. The detection of single words patters works fine. The problem is the detection of multiword patterns: it doens’t work. I wrote this script:

import spacy
from import Spanish
from spacy.pipeline import EntityRuler
from datetime import datetime

nlp = Spanish()
ruler = EntityRuler(nlp)

print (str( + " - Starting Snomed patterns import")
print (str( + " - Snomed patterns imported")

print (str( + " - " +str(len(ruler.labels)) + " labels were imported")
print (str( + " - " +str(len(ruler.patterns)) + " patterns were imported")

print (str( + " - Adding Snomed entity ruler to pipe")
print (str( + " - Snomed entity ruler added to pipe")

sentence = "El medico me receto ibuprofeno, paracetamol, tetrahidruro de germanio y trastuzumab emtansina"
doc = nlp(sentence.lower())

print ("Sentece: " + sentence)

print ("Entities:")
print([(ent.text, ent.label_) for ent in doc.ents])

The entities detected by the nlp process are:

2019-03-28 17:00:00.089556 - Starting Snomed patterns import
2019-03-28 17:00:03.144011 - Snomed patterns imported
2019-03-28 17:00:03.144068 - 66 labels were imported
2019-03-28 17:00:03.144115 - 243723 patterns were imported
2019-03-28 17:00:03.441433 - Adding Snomed entity ruler to pipe
2019-03-28 17:00:03.441487 - Snomed entity ruler added to pipe
Sentece: El medico me receto ibuprofeno, paracetamol, tetrahidruro de germanio y trastuzumab emtansina
[('medico', 'ocupacion'), ('ibuprofeno', 'sustancia'), ('germanio', 'sustancia'), ('y', 'estadificacion tumoral')]

“medico”, “ibuprofeno” and “y” are right. It doesn’t detect “tetrahidruro de germanio” (detects the isolated word “germanio” as it exists in the patterns) nor “trastuzumab emtansina”. Both “tetrahidruro de germanio” and “trastuzumab emtansina” exist in the patterns file. I don’t understand why they are not detected. What am I doing wrong?

Thanks a lot and BR,

Javier Movilla

Could you share the patterns for these particular examples? If multi-word patterns don't match, the most common explanation is that they actually don't describe separate tokens. For example, if your pattern looks like this:

[{"lower": "trastuzumab emtansina"}]  # this is wrong and won't match will never match because you're telling spaCy to look for one token whose lowercase text equals "trastuzumab emtansina". This will never be true, because spaCy splits "trastuzumab emtansina" into two tokens. So instead, the pattern would have to look like this:

[{"lower": "trastuzumab"}, {"lower": "emtansina"}]

This describes two tokens – "trastuzumab" and "emtansina". If you're unsure about how spaCy will tokenize a string, you can always check by processing it and printing [token.text for token in doc].


Sorry to share so late. I followed your instructions and everything went fine. Now active ingredientes are detected without problems.

My challenge now is to preprocess the texts to avoid transcription errors. I’ll give a try to FuzzyWuzzy again, let’s see how it goes.

Thanks a lot and best regards,

Javier Movilla

1 Like