NER Training for Corporate Names

Ah, I think that might be a small misunderstanding: The entity ruler itself doesn't learn anything, just like the statistical entity recognizer doesn't "learn" anything from the entity ruler. But if entities are already set in a previous pipeline step and the statistical entity recognizer encounters them, it will "predict around them" and use them as constraints for its predictions. This means that a pre-trained statistical NER model may produce better results and make fewer mistakes, because some of the wrong predictions it woud have made otherwise are now impossible or very unlikely.

For a quick overview of how the entity labels are predicted and what the BILUO scheme (e.g. B-PERSON etc.) means, see my comment here: EntityRuler causes NER entities to go missing · Issue #3775 · explosion/spaCy · GitHub

To give you an example, let's say you have a sentence like: "He works at John Doe's ACME Inc.". Your model may analyse it like this and incorrectly predict "John Doe's ACME Inc." as a company (which isn't even so far-fetched, but it's obviously wrong):

["He", "works", "at", "John", "Doe", "'s", "ACME", "Inc."]  # Tokens
["O", "O", "O", "B-ORG", "I-ORG", "I-ORG", "I-ORG", "L-ORG"]  # Predicted entity tags

Now imagine you have the entity ruler in the pipeline before the named entity recognizer and "ACME Inc." is covered by a pattern. It'll assign the entity for those tokens (B-ORG, i.e. beginning of an entity, and L-ORG, i.e. last token of an entity):

["He", "works", "at", "John", "Doe", "'s", "ACME", "Inc."]  # Tokens
["?", "?", "?", "?", "?", "?", "B-ORG", "L-ORG"]  # Entity tags added by the EntityRuler

Next, the statistial entity recognizer is applied and encounters this state. Predicting "John Doe's ACME Inc." as a company is now imposible, because we already know that "ACME" is a B-ORG, i.e. the beginning of an ORG entity span. So it can't also be inside an entity span. The entity recognizer will only fill in the gaps and predict the labels for the other tokens – and if you're lucky, the correct analysis where "John Doe" is a person will now be a lot more likely in this context:

["He", "works", "at", "John", "Doe", "'s", "ACME", "Inc."]  # Tokens
["?", "?", "?", "?", "?", "?", "B-ORG", "L-ORG"]  # Entity tags added by the EntityRuler
["O", "O", "O", "B-PERSON", "L-PERSON", "O", "B-ORG", "L-ORG"]  # Final entity tags with predictions

It's the same model with the same weights – but it's able to make better predictions because your rules have defined better constraints for them at runtime.

2 Likes