Yes, this sounds like a reasonable approach. Depending on how well the model learns from the initial set of annotated examples, you could also try improving it with binary annotations using ner.teach
. I'd recommend running smaller experiments to find out what works best before you go all-in.
Just make sure to always use the previously pre-trained model as the base model for the next annotation step – for example, when you run ner.make-gold
, load in the model that was saved in the previous ner.batch-train
step.
Depending on the domain you're working with, it might also be worth looking into existing resources that you can use to at least roughly pre-train the model? The categories you're working with are pretty standard (PERSON
, LOCATION
etc.) and even if the text type doesn't match perfectly, it'll at least give you a bit more to work and experiment with.
Whether they're considered stop words or not shouldn't really matter – at least not as far as the model is concerned.
If you have an existing list of names, you could use them as the --patterns
provided via the ner.teach
or ner.match
recipes. This lets you supply specific or abstract token-based examples fo the entities you're looking for, and Prodigy will then match and present those phrases in context so you can accept/reject them. For example:
{"label": "PERSON", "pattern": [{"lower": "john"}, {"lower": "doe"}]}
You could also write your own annotation workflow in a custom recipe that takes the list of examples and pre-selects them in your data. The main goal here is to make annotation faster and allow you to run quicker experiments – even if your patterns only cover 50% of the entities, you still only have to label the other 50% manually (instead of everything).
Another thing you could try is combine the statistical model you're training with a rule-based approach, e.g. using spaCy's Matcher
or PhraseMatcher
. For example, you could add a custom pipeline component that makes sure that unambiguous city and country names are always added to the doc.ents
, even if they're not predicted as an entity by the statistical model. You don't have to solely rely on the model's predictions – combining the predictions with rules is often much more powerful.
This is really difficult to say, because accuracy is relative and depends on what you're evaluating your system against. Ultimately, in the "real world", what matters is whether your system is useful or not.
If you're building an anonymisation system, your model likely won't have the "final word" and instead, you'll be using it to make the process easier for humans and/or to flag potential problems, right? Otherwise, even a model that's 90% accurate on your runtime data (which could be considered pretty state-of-the-art for NER, I guess), would make a mistake on every 10th prediction, which is potentially fatal. After all, a 90% anonymised text is... not anonymised.