Training NER with Entity Ruler

Hi all,

I can’t seem to find any examples or documentation about training an NER model “around” the Entity Ruler. The examples given here have all other pipeline components disabled while training the NER model, which I understand, but if I wanted to use the Entity Ruler before the NER model to help improve predictions, would I leave the Entity Ruler enabled in the pipeline while training the NER model or is there another method for doing this?


Hi! One thing that might be important to note is that the “accuracy improvement” that can be achieved using the entity ruler is usually an improvement at runtime, not during training. The idea is that you can pre-define entities using your rules, and the pre-trained statistical entity recognizer will then be prevented from making conflicting predictions, and also take the existing spans into account when predicting entity tags for the remaining tokens. And you’ll be able to add entities that the model would have otherwise missed.

During training, you’re updating the weights of the ner component based on a list of examples plus entity annotations. This is done by calling nlp.update with a tokenized Doc and the respective annotations (e.g. entity spans). So whether the entity ruler is present here or not shouldn’t make a difference, because it won’t be used.

However, you can use the entity ruler to create training data more easily: process each text with the nlp object, extract spans for all doc.ents, export the examples, correct them in Prodigy using a recipe like ner.manual and then use the resulting dataset to train your model. Here’s an example:

examples = []
# Let's assume your nlp object has an entity ruler in the pipeline
for doc in nlp.pipe(LOTS_OF_TEXTS):
    spans = []
    for ent in doc.ents:
        span = {"start": ent.start_char, "end": ent.end_char, "label": ent.label}
    example = {"text": doc.text, "spans": spans}

This will create one dict per text, with a list of "spans" describing the character offsets of each entity defined by the entity ruler. You could then export your examples to a JSONL file to load into Prodigy:

prodigy.util.write_jsonl("/path/to/data.jsonl", examples)

You could then run the ner.manual recipe, which will show you the pre-highlighted spans and allow you to correct mistakes. Ideally, there should be very little to do – but even if your rules only cover 50% of the entities, that’s still 50% less work for you!

prodigy ner.manual corrected_entities en_core_web_sm /path/to/data.jsonl --label LABEL_ONE,LABEL_TWO

You could also try using less strict rules here to allow for some ambiguity. This is also a nice way to inspect your data and evaluate your rules. Once you’re done correcting, you can compare your corrected dataset to the original input and check how well your rules performed. This will also give you a good baseline for when you train your model – because you’ll definitely want to beat a purely rule-based approach :slightly_smiling_face:

1 Like

Sorry to bring this back.
Is all this covered now by ner.match where we load a patterns file to do the same, correct false positives?

Yes, ner.match lets you go through all pattern matches and accept or reject them, which should be very fast :slightly_smiling_face:

The solution I outlined above is for streaming in data with all matches highlighted and allowing manual corrections and adding more entities that are not covered in the patterns.

Hi there !

Do you think there is possible to create a recipe for automated this approach ? (streaming data with all matches highlighted and then allowing manual corrections)

Thanks !

@MatthieuOD You should already be able to do that out-of-the-box by saving out the nlp object with the entity ruler, and then loading that model into ner.make-gold. The ner.make-gold recipe will highlight whatever the model outputs in the doc.ents – so if the model only has an entity ruler, the matches will be used and highlighted.