Is it possible to create a nested label structure for NER? For example, I have a new entity - DRUG, but also there are several subtypes: ‘antidepressants’, ‘sedative’, ‘cardiac’ etc. Something like:
I would do this as a multiple-pass annotation procedure. First label the top-most category, DRUG, and then create a recipe that enqueues the examples of DRUG you’ve annotated, for annotation into the subcategories.
When you do the second pass, you might want to group up the examples by type. If the token “citalopram” is a DRUG in a particular context, it’s probably always going to have the same subtype. So you can save yourself a lot of work by making that decision once, rather than for every occurrence of the phrase.
If you can satisfy your objectives by having subtype schemes that are unambiguous, that will make both the annotation and the machine learning much easier: you can deal with ambiguity once, at the top-most category that the NER model deals with. Then you have manually vetted dictionaries that map common entities to your subtypes.
Finally, you might use the word vectors to resolve any entities the model has recognised that aren’t in your dictionaries. You would have a prototype vector for “antidepressant” made by averaging the vectors of your antidepressant terms, and another prototype vector for “sedative”. Creating the prototype is as simple as making a Doc object with all the terms in that category. So you would have something like antidepressants = Doc(words=['citalopram', 'lexapro', ...]), and then you’d ask antidepressants.similarity(new_entity).
I’ve a similar case as you. However, some (many) drugs are used for different purposes, especially neuro-active drugs. How did you solve that? Declare a new type like “neuro-active”?
One thing you could do is add a second pipeline component after the entity recognizer that looks for DRUG entities and then sets a custom attribute on those entities specifying a list of subtypes (e.g. ent._.entity_subtypes).
Here’s an example of a rule-based solution – but you could obviously also swap out the dictionary for a statistical solution, or a combination.
from spacy.tokens import Span
# dictionary of lowercase entities mapped to subtypes
DRUG_SUBTYPES = {
'citalopram': ['ANTIDEPRESSANT', 'SOMETHING_ELSE'],
'lexapro': ['ANTIDEPRESSANT'],
# etc.
}
# register global span._.entity_subtype extension
Span.set_extension('entity_subtypes', default=None)
def assign_subtypes(doc):
# this function will be added after the NER in the pipeline
for ent in doc.ents:
if ent.label_ == 'DRUG':
# look up entity text and set custom attribute
ent._.entity_subtypes = DRUG_SUBTYPES.get(ent.text)
return doc
I’ll start working on it. But in the end I won’t get around doing a lot of typing, getting all (most) drugs in the model. And the same will hold for all diseases
I haven’t solved it completely yet (I had a couple of other challenges on top of that), but I found a similar thread here and thought in the way Ines presented it here (to set an attribute to DRUG ents).
BTW, it is possible to create a large gazetteer for all drugs, though a bit time consuming. For example (depending on your country, as you should also look at your local pharma as trade names, dosages may be different), you can use https://www.drugbank.ca/ to download a comprehensive database (for free) on drugs, including types, generic/compound/trade names, etc. ChEMBL (https://www.ebi.ac.uk/chembl/) is also very useful.
From my experience, combining a new statistical model for NER (based on Prodigy) with PhraseMatcher works quite good. It does take time (a lot!) to create a vocabulary of terms, but then it may benefit you seriously.
Perhaps we can share our solutions (if it acceptable of course, given your project etc).
sorry for the slow reaction. I’ve worked on scraping medical info for consumers from a variety of sites for the training step in Prodigy (medline, nhlbi etc. are great!) and I’m now back at the NER seeding problem.
I’ve given the principle some thought, but a (drug, drug-subclass), or (drug, purpose) classification probably limits the search space. The idea of word2vec and similar methods is that the meaning is determined by the context, and the subclass or purpose assignment of the drug should follow from the context.
My use case also requires a “disease” category, basically the same as the “purpose” concept (disease = purpose = medical_conditon). I had the idea of taking the labels MEDICAL_CONDITION and DRUG as the gold standard, and learn what drug is used to treat what condition. So you have the gold standard (label:DRUG, pattern:citalopram) and (label:MEDICAL_CONDITION, pattern:depression), or more detailed (label:MEDICAL_CONDITION, sub_label: NEUROLOGY_PSYCHIATRY, pattern:depression). The system would learn that citalopram (label:DRUG) is linked to “depression” that is labeled as (label:MEDICAL_CONDITION) or (label:MEDICAL_CONDITION, sub_label: NEUROLOGY_PSYCHIATRY)
but maybe solution is very usecase-specific, and you need a hierarchical architecture. Feedback welcome
Andreas
(Now still figuring out how I get that in a nice NER table
I am new to Spacy.I am training ner on my data and I am using the code that you mentioned for assigning super_category to entities returned by my model. However it is returning below error while loading the model:
KeyError: "[E002] Can't find factory for 'assign_category'. This usually happens when spaCy calls nlp.create_pipe with a component name that's not built in`
below is my code, could you please help me if i am missing anything here
category_type = {
'PERSON': ['OT'],
'ORG': ['OT'],
'QUANTITY': ['NUMBER']
'AMOUNT': ['NUMBER']
}
def set_extension():
# register global span._.entity_category extension
Span.set_extension('entity_category', default=None)
def assign_category(doc):
# dictionary of lowercase entities mapped to subtypes
for ent in doc.ents:
try:
# look up entity text and set custom attribute
ent._.entity_category = category_type.get(ent.label_)
except:
ent._.entity_category = ''
return doc
in another function where i am training my model:
with nlp.disable_pipes(*other_pipes): # only train NER
if model is None:
optimizer = nlp.begin_training()
else:
optimizer = nlp.entity.create_optimizer()
for itn in range(10):
print("Starting iteration " + str(itn))
random.shuffle(train_data)
losses = {}
for text, annotations in train_data:
nlp.update(
[text],
[annotations],
drop=0.25,
sgd=optimizer,
losses=losses)
print(losses)
set_extension()
nlp.add_pipe(assign_category,after='ner')
The problem here is that when you save out the model, spaCy will save out the name of your pipeline component, assign_category. But when you load the model back in, it doesn't know how to initialize that component.
Excuse me, but could you please specify where exactly do I need to insert this code? Do I have to create a custom recipe and use it as an argument in a console instead of ner.manual?
Hi! Which code are you referring to, exactly? Most of the snippets in this thread show code that you can use in your application / script on top of an existing model to attach more metadata to the entity predictions.
Hi! Thank you, I think I got that part which I was asking about, but now It's not clear where this subtype should appear
To clear up, I am writing a custom recipe for price extraction (value and currency separately)
# dictionary of lowercase entities mapped to subtypes
CURRENCY_SUBTYPES = {
'EUR': ['€', 'евр', 'euro'],
'USD': ['US$','$','dollars']}
# register global span._.entity_subtype extension
Span.set_extension('entity_subtypes', default=None)
def assign_subtypes(doc):
# this function will be added after the NER in the pipeline
for ent in doc.ents:
if ent.label_ == 'CURRENCY':
# look up entity text and set custom attribute
ent._.entity_subtypes = [k for k, v in CURRENCY_SUBTYPES.items() if ent.text in v][0]
print(ent.text)
return doc
nlp = spacy.load(spacy_model)
nlp.add_pipe(assign_subtypes, after='ner')
stream = CSV(source)
# Tokenize the incoming examples and add a "tokens" property to each
# example. Also handles pre-defined selected spans. Tokenization allows
# faster highlighting, because the selection can "snap" to token boundaries.
stream = add_tokens(nlp, stream)
return {
"view_id": "ner_manual", # Annotation interface to use
"dataset": dataset, # Name of dataset to save annotations
"stream": stream, # Incoming stream of examples
"exclude": exclude, # List of dataset names to exclude
"config": { # Additional config settings, mostly for app UI
"lang": nlp.lang,
"labels": label, # Selectable label options
},
}
However, the function assign_subtypes seems to do nothing. At least nothing changes in the output and the logging doesn't show up when it is called inside this function
assign_subtypes adds custom extension attributes to spaCy Span objects when you process a text with the nlp object and if the entity recognizer recognised a span as CURRENCY (assuming you have trained this custom entity label).
This is completely independent of Prodigy and something you'd use in your application later on to classify the entities into subtypes (e.g. after you've labelled and trained your entity recognizer).