The ner.teach
recipe will retrieve all possible analyses of the text from the model, and will then ask you for feedback on the ones with a prediction closest to 0.5
. So generally speaking, ner.teach
will suggest whatever the model predicts that fits within these constraints. By default, the NER model uses the prefix
(first character) and suffix
(last 3 characters) attributes as part of its features, so depending on the type of spelling errors, it might be able to generalise pretty well here.
The difficulty here is that an approach like this would become a lot more complex if you also need to deal with multi-word entities. If you're only looking at single tokens, you could come up with a performant solution – but once you need to look at all possible spans and combinations of subsequent tokens, this becomes really tricky. That's also one of the reasons why statistical named entity recognition is often a better fit for problems like this.
However, once you already have a model that predicts DRUG
entities, you could use the edit distance to normalize the entity names to common entries in a dictionary.
If you're looking to train a model, your current PhraseMatcher
solution will definitely come in handy, though, as it gives you a really easy supply of "free" training data. If you run your existing pipeline over your text using spaCy, you can extract the entities it predicts and use that as training data for a new model.
You could also use data augmentation to explicitly teach the model more about spelling variations. For example, you could create a little misspellings dictionary and then randomly replace the correct spellings with the misspellings, adjust the span boundaries if necessary (i.e. if the misspelling is shorter or longer) and add those examples to your training data. This can make the model less sensitive to spelling variations.
Which other entities do you need to predict? From what you describe, it sounds like you might be much better off starting from scratch. If you have a pipeline that works reasonably well, you can use that to create training data for you, and only include the labels you care about. For example:
examples = [] # export this later
labels = ('DRUG', 'PERSON', 'ORG') # labels you want to keep
# let's assume the nlp object has your full pipeline of NER
# plus custom PhraseMatcher component
for doc in nlp.pipe(YOUR_TEXTS):
spans = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
for ent in doc.ents if ent.label_ in labels]
examples.append({'text': doc.text, 'spans': spans})
Instead of updating an existing model with your new DRUG
entity type, you can then train a new model from scratch that can predict PERSON
, ORG
and DRUG
entities. For PERSON
and ORG
, the training data will come from the model's existing predictions. For DRUG
, it'll come from your phrase matcher.
If you like, you can also add a manual step in between that lets you correct the extracted examples if necessary. If your existing pipeline already gets about 80% right, this is still much better than labeling everything by hand, since you'll only have to do 20% yourself.
prodigy ner.manual ner_gold en_core_web_sm examples.jsonl --label DRUG,PERSON,ORG