NER Standardized Name Output

I’m loving NER training and getting good results on NER identification. However, I’m struggling with mapping synonyms/abbreviations to a single Standardized name. I’ll give a couple examples

  • Machine learning, ML -> Machine Learning
  • RNN, rnn, Recurrent neural network, recurrant neural network -> Recurrent Neural Network (spelling mistake on purpose)

I’m building a SKILL extractor and need to be able to extract skills (mainly data science related) but also map the extracted entities to a single name. (I have a list of the standardized names).

I saw this on the spacy.io website: https://github.com/explosion/spacy/blob/master/examples/pipeline/custom_component_entities.py

Would this be the right way to achieve this behavior?
Thanks for the awesome work!

Thanks! Nice to hear that NER training has been working well for you so far :slightly_smiling_face:

In general, this can be achieved by training another statistical model to predict only those mappings and add it after the entity recognizer in the pipeline. The terms that are commonly used for this are "entity linking" or "named entity disambiguation".

However, you might find that a more straightforward, rule-based approach works just as well. "Data science skills" is a pretty well-defined category. So just by looking at the top 100 entities, you should be able to bootstrap a dictionary for this pretty quickly. You should also be able to cover most misspellings by looking at the edit distance.

Here's a simple, semi-pseudocode example of how you could set this up via a custom pipeline component. To store the "standardised" entity label, you could add a custom attribute entity_norm to all Spans (which includes entity spans).

from spacy.tokens import Span
# register extension and make it default to None
Span.set_extension('entity_norm', default=None)

Here's a function that takes a text and returns the "normalised form". You could also experiment with lowercasing the text, calculate the edit distance or compute other things:

def get_entity_norm(text):
    norm_dict = {
        'machine learning': 'Machine Learning',
        'ML': 'Machine Learning'
        # etc.
    }
    # get mapping from dict, otherwise return original text
    return norm_dict.get(text, text)

And here's the pipeline component that takes a spaCy Doc and sets the entity_norm attribute for all entities that are predicted as SKILL:

def entity_linking_component(doc):
   # this function can be added to a spaCy pipeline
    for ent in doc.ents:
        if ent.label_ == 'SKILL':
            # look up the entity norm for the given text
            ent._.entity_norm = get_entity_norm(span.text) 
    return doc

Usage could then look something like this:

nlp = spacy.load('your-custom-model')
nlp.add_pipe(entity_linking_component, after='ner')

# let's assume your model predicts all entities correctly here
doc = nlp(u"I have 3 years of ML experience (TF, Keras, pytorch)")
for ent in doc.ents:
    print(ent.text, ent._.entity_norm, ent.label_)

# ML Machine Learning SKILL
# TF TensorFlow SKILL
# Keras Keras SKILL
# pytorch PyTorch SKILL

The more you can fine-tune your model to predict SKILL accurately, the better the results. You can also use Prodigy to conduct regular evaluations to make sure you're not missing any important skills, to find more examples to add to your dictionary / entity linking logic and to make sure your entities are resolved correctly.

2 Likes

Thanks Ines! This is very helpful! The code samples are a huge time saver.