NER or PhraseMatcher?

Yes, that’s correct. The entity recognizer will respect already existing entity spans set by previous pipeline components. Their boundaries are used as constraints for the model’s predictions.

If you have good rules, you could also use them to bootstrap training data for your model and improve the entity recognizer, without having to label anything from scratch. This would then allow you to go beyond your rules and be able to label, say, “May-Ayim-Ufer 9”, even if none of those components were part of your gazetteer. Here’s an example of a possible workflow:

  1. Create gazetteers for your categories and write rules to handle ambiguity (e.g. "Richard Wagner " vs. “Richard Wagner Straße”).
  2. Add your rule-based component to your spaCy pipeline, parse lots of text and extract the text plus entities.
  3. Load the data into Prodigy and run ner.manual to see the entities and correct them if necessary. If your rules are good and 90% accurate, it means you only have to change something about 10% of the cases. So this should be super quick.
  4. Use the created data as gold-standard training data for your model.
2 Likes