Misspelled named entity extraction

I am wondering whether Prodigy can handle misspelled entities. For example, for drug names extraction from medical notes, some drug names could be misspelled and when annotating with ner.teach recipe, will Prodigy suggest misspelled tokens too?

I have a large list of drug names which I am interested in extracting from texts, at the moment, I’m trying both options, with PhraseMatcher and ner.teach and her.batch-train option (I’m very new to spaCy and Prodigy, so still learning thus may no be aware of better solutions). PhraseMatcher works just fine: provided my list of drug names, it finds them all in texts. With ner.teach/btach-train I’m running into ‘catastrophic forgetting’ problem and trying to sort it out.

I am happy to keep working with PhraseMatcher, however it misses the misspelled entities. I thought about adding edit distance on top of PhraseMatcher to account for misspelled ones, but not sure how to implement effectively.

Any suggestions how to extract misspelled entities please?

The ner.teach recipe will retrieve all possible analyses of the text from the model, and will then ask you for feedback on the ones with a prediction closest to 0.5. So generally speaking, ner.teach will suggest whatever the model predicts that fits within these constraints. By default, the NER model uses the prefix (first character) and suffix (last 3 characters) attributes as part of its features, so depending on the type of spelling errors, it might be able to generalise pretty well here.

The difficulty here is that an approach like this would become a lot more complex if you also need to deal with multi-word entities. If you're only looking at single tokens, you could come up with a performant solution – but once you need to look at all possible spans and combinations of subsequent tokens, this becomes really tricky. That's also one of the reasons why statistical named entity recognition is often a better fit for problems like this.

However, once you already have a model that predicts DRUG entities, you could use the edit distance to normalize the entity names to common entries in a dictionary.

If you're looking to train a model, your current PhraseMatcher solution will definitely come in handy, though, as it gives you a really easy supply of "free" training data. If you run your existing pipeline over your text using spaCy, you can extract the entities it predicts and use that as training data for a new model.

You could also use data augmentation to explicitly teach the model more about spelling variations. For example, you could create a little misspellings dictionary and then randomly replace the correct spellings with the misspellings, adjust the span boundaries if necessary (i.e. if the misspelling is shorter or longer) and add those examples to your training data. This can make the model less sensitive to spelling variations.

Which other entities do you need to predict? From what you describe, it sounds like you might be much better off starting from scratch. If you have a pipeline that works reasonably well, you can use that to create training data for you, and only include the labels you care about. For example:

examples = []  # export this later
labels = ('DRUG', 'PERSON', 'ORG')  # labels you want to keep

# let's assume the nlp object has your full pipeline of NER
# plus custom PhraseMatcher component
for doc in nlp.pipe(YOUR_TEXTS):
    spans = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_} 
             for ent in doc.ents if ent.label_ in labels]
    examples.append({'text': doc.text, 'spans': spans})

Instead of updating an existing model with your new DRUG entity type, you can then train a new model from scratch that can predict PERSON, ORG and DRUG entities. For PERSON and ORG, the training data will come from the model's existing predictions. For DRUG, it'll come from your phrase matcher.

If you like, you can also add a manual step in between that lets you correct the extracted examples if necessary. If your existing pipeline already gets about 80% right, this is still much better than labeling everything by hand, since you'll only have to do 20% yourself.

prodigy ner.manual ner_gold en_core_web_sm examples.jsonl --label DRUG,PERSON,ORG