Hi! One thing that might be important to note is that the “accuracy improvement” that can be achieved using the entity ruler is usually an improvement at runtime, not during training. The idea is that you can pre-define entities using your rules, and the pre-trained statistical entity recognizer will then be prevented from making conflicting predictions, and also take the existing spans into account when predicting entity tags for the remaining tokens. And you’ll be able to add entities that the model would have otherwise missed.
During training, you’re updating the weights of the ner
component based on a list of examples plus entity annotations. This is done by calling nlp.update
with a tokenized Doc
and the respective annotations (e.g. entity spans). So whether the entity ruler is present here or not shouldn’t make a difference, because it won’t be used.
However, you can use the entity ruler to create training data more easily: process each text with the nlp
object, extract spans for all doc.ents
, export the examples, correct them in Prodigy using a recipe like ner.manual
and then use the resulting dataset to train your model. Here’s an example:
examples = []
# Let's assume your nlp object has an entity ruler in the pipeline
for doc in nlp.pipe(LOTS_OF_TEXTS):
spans = []
for ent in doc.ents:
span = {"start": ent.start_char, "end": ent.end_char, "label": ent.label}
spans.append(span)
example = {"text": doc.text, "spans": spans}
examples.append(example)
This will create one dict per text, with a list of "spans"
describing the character offsets of each entity defined by the entity ruler. You could then export your examples
to a JSONL file to load into Prodigy:
prodigy.util.write_jsonl("/path/to/data.jsonl", examples)
You could then run the ner.manual
recipe, which will show you the pre-highlighted spans and allow you to correct mistakes. Ideally, there should be very little to do – but even if your rules only cover 50% of the entities, that’s still 50% less work for you!
prodigy ner.manual corrected_entities en_core_web_sm /path/to/data.jsonl --label LABEL_ONE,LABEL_TWO
You could also try using less strict rules here to allow for some ambiguity. This is also a nice way to inspect your data and evaluate your rules. Once you’re done correcting, you can compare your corrected dataset to the original input and check how well your rules performed. This will also give you a good baseline for when you train your model – because you’ll definitely want to beat a purely rule-based approach