Nested labels for NER

Is it possible to create a nested label structure for NER? For example, I have a new entity - DRUG, but also there are several subtypes: ‘antidepressants’, ‘sedative’, ‘cardiac’ etc. Something like:

drug_list.jsonl ->

{“label”:“DRUG”,“subtype”:“antidepressant”, “pattern”:[{“lower”:“citalopram”}]}
{“label”:“DRUG”, “subtype”:“sedative”, “pattern”:[{“lower”:“clonazepam”}]}

Or any other solution how to have access to entities and their subcategories.

Thanks.

I would do this as a multiple-pass annotation procedure. First label the top-most category, DRUG, and then create a recipe that enqueues the examples of DRUG you’ve annotated, for annotation into the subcategories.

When you do the second pass, you might want to group up the examples by type. If the token “citalopram” is a DRUG in a particular context, it’s probably always going to have the same subtype. So you can save yourself a lot of work by making that decision once, rather than for every occurrence of the phrase.

If you can satisfy your objectives by having subtype schemes that are unambiguous, that will make both the annotation and the machine learning much easier: you can deal with ambiguity once, at the top-most category that the NER model deals with. Then you have manually vetted dictionaries that map common entities to your subtypes.

Finally, you might use the word vectors to resolve any entities the model has recognised that aren’t in your dictionaries. You would have a prototype vector for “antidepressant” made by averaging the vectors of your antidepressant terms, and another prototype vector for “sedative”. Creating the prototype is as simple as making a Doc object with all the terms in that category. So you would have something like antidepressants = Doc(words=['citalopram', 'lexapro', ...]), and then you’d ask antidepressants.similarity(new_entity).

Thanks Matt, I will try this approach.

Andrey,

I’ve a similar case as you. However, some (many) drugs are used for different purposes, especially neuro-active drugs. How did you solve that? Declare a new type like “neuro-active”?

Thanks

One thing you could do is add a second pipeline component after the entity recognizer that looks for DRUG entities and then sets a custom attribute on those entities specifying a list of subtypes (e.g. ent._.entity_subtypes).

Here’s an example of a rule-based solution – but you could obviously also swap out the dictionary for a statistical solution, or a combination.

from spacy.tokens import Span

# dictionary of lowercase entities mapped to subtypes
DRUG_SUBTYPES = {
    'citalopram': ['ANTIDEPRESSANT', 'SOMETHING_ELSE'],
    'lexapro': ['ANTIDEPRESSANT'],
    # etc.
}

# register global span._.entity_subtype extension
Span.set_extension('entity_subtypes', default=None)

def assign_subtypes(doc):
    # this function will be added after the NER in the pipeline
    for ent in doc.ents:
        if ent.label_ == 'DRUG':
            # look up entity text and set custom attribute
            ent._.entity_subtypes = DRUG_SUBTYPES.get(ent.text)
    return doc

You could then use the component like this:

nlp = spacy.load('/path/to/your/drugs/model')
nlp.add_pipe(assign_subtypes, after='ner')

Thanks Ines,

I’ll start working on it. But in the end I won’t get around doing a lot of typing, getting all (most) drugs in the model. And the same will hold for all diseases :frowning:

Andreas

Hi Andreas,

I haven’t solved it completely yet (I had a couple of other challenges on top of that), but I found a similar thread here and thought in the way Ines presented it here (to set an attribute to DRUG ents).

BTW, it is possible to create a large gazetteer for all drugs, though a bit time consuming. For example (depending on your country, as you should also look at your local pharma as trade names, dosages may be different), you can use https://www.drugbank.ca/ to download a comprehensive database (for free) on drugs, including types, generic/compound/trade names, etc. ChEMBL (https://www.ebi.ac.uk/chembl/) is also very useful.

From my experience, combining a new statistical model for NER (based on Prodigy) with PhraseMatcher works quite good. It does take time (a lot!) to create a vocabulary of terms, but then it may benefit you seriously.

Perhaps we can share our solutions (if it acceptable of course, given your project etc).

Best wishes,
Andrey

Hi Andrey,

sorry for the slow reaction. I’ve worked on scraping medical info for consumers from a variety of sites for the training step in Prodigy (medline, nhlbi etc. are great!) and I’m now back at the NER seeding problem.
I’ve given the principle some thought, but a (drug, drug-subclass), or (drug, purpose) classification probably limits the search space. The idea of word2vec and similar methods is that the meaning is determined by the context, and the subclass or purpose assignment of the drug should follow from the context.
My use case also requires a “disease” category, basically the same as the “purpose” concept (disease = purpose = medical_conditon). I had the idea of taking the labels MEDICAL_CONDITION and DRUG as the gold standard, and learn what drug is used to treat what condition. So you have the gold standard (label:DRUG, pattern:citalopram) and (label:MEDICAL_CONDITION, pattern:depression), or more detailed (label:MEDICAL_CONDITION, sub_label: NEUROLOGY_PSYCHIATRY, pattern:depression). The system would learn that citalopram (label:DRUG) is linked to “depression” that is labeled as (label:MEDICAL_CONDITION) or (label:MEDICAL_CONDITION, sub_label: NEUROLOGY_PSYCHIATRY)

but maybe solution is very usecase-specific, and you need a hierarchical architecture. Feedback welcome

Andreas

(Now still figuring out how I get that in a nice NER table :wink:

Hi Ines,

I am new to Spacy.I am training ner on my data and I am using the code that you mentioned for assigning super_category to entities returned by my model. However it is returning below error while loading the model:

KeyError: "[E002] Can't find factory for 'assign_category'. This usually happens when spaCy calls nlp.create_pipe with a component name that's not built in`

below is my code, could you please help me if i am missing anything here

category_type = {
    'PERSON': ['OT'],
    'ORG': ['OT'],
    'QUANTITY': ['NUMBER']
    'AMOUNT': ['NUMBER']
}

def set_extension():
    # register global span._.entity_category extension
    Span.set_extension('entity_category', default=None)

def assign_category(doc):
    # dictionary of lowercase entities mapped to subtypes
    for ent in doc.ents:
        try:
            # look up entity text and set custom attribute
            ent._.entity_category = category_type.get(ent.label_)
        except:
            ent._.entity_category = ''
    return doc

in another function where i am training my model:

with nlp.disable_pipes(*other_pipes):  # only train NER
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.entity.create_optimizer()
    for itn in range(10):
        print("Starting iteration " + str(itn))
        random.shuffle(train_data)
        losses = {}
        for text, annotations in train_data:
            nlp.update(
                [text],  
                [annotations],  
                drop=0.25,  
                sgd=optimizer, 
                losses=losses)
        print(losses)


set_extension()
nlp.add_pipe(assign_category,after='ner')

Check out the following section in the spaCy docs about pipeline components and factories: https://spacy.io/usage/processing-pipelines#custom-components-factories

The problem here is that when you save out the model, spaCy will save out the name of your pipeline component, assign_category. But when you load the model back in, it doesn't know how to initialize that component.

Thanks Ines.