Few Questions related to a Danish NER system using prodigy

Hey guys, I’m in a situation where I need to create a model that helps me categorize text-content that makes it possible to identify someone. In essence the goal is to anonymize the document (people, locations, etc.) and I plan on doing so by using NER.

I expected to work with over 10.000 documents, the documents are written in Danish. All texts are very domain specific, for example they are all about window-sales.

From browsing the documentation, support forum and experimenting with prodigy I’ve decided on the following workflow.

  • Create word vectors using the Danish documents (remove all vectors except for 20.000 )
  • Create an empty Danish model with the wordvectors
  • Annotate the model using ner.manual
  • Ner. Batch-train
  • Annotate using ner.make-gold
  • Ner. Batch-train

Am i correct in my assumption that this would be a correct way of moving forward ?

I have lists of all viable First names and locations in Denmark that would need to be anonymized, is there anyway to use these help the model? The names often overlap with stop-words though which is a bit frustrating.

This might be a bit of a longshot, but what kind of accuracy could I expected from such a model ?

Any other tips would be greatly appreciated.

/Kevin

Yes, this sounds like a reasonable approach. Depending on how well the model learns from the initial set of annotated examples, you could also try improving it with binary annotations using ner.teach. I'd recommend running smaller experiments to find out what works best before you go all-in.

Just make sure to always use the previously pre-trained model as the base model for the next annotation step – for example, when you run ner.make-gold, load in the model that was saved in the previous ner.batch-train step.

Depending on the domain you're working with, it might also be worth looking into existing resources that you can use to at least roughly pre-train the model? The categories you're working with are pretty standard (PERSON, LOCATION etc.) and even if the text type doesn't match perfectly, it'll at least give you a bit more to work and experiment with.

Whether they're considered stop words or not shouldn't really matter – at least not as far as the model is concerned.

If you have an existing list of names, you could use them as the --patterns provided via the ner.teach or ner.match recipes. This lets you supply specific or abstract token-based examples fo the entities you're looking for, and Prodigy will then match and present those phrases in context so you can accept/reject them. For example:

{"label": "PERSON", "pattern": [{"lower": "john"}, {"lower": "doe"}]}

You could also write your own annotation workflow in a custom recipe that takes the list of examples and pre-selects them in your data. The main goal here is to make annotation faster and allow you to run quicker experiments – even if your patterns only cover 50% of the entities, you still only have to label the other 50% manually (instead of everything).

Another thing you could try is combine the statistical model you're training with a rule-based approach, e.g. using spaCy's Matcher or PhraseMatcher. For example, you could add a custom pipeline component that makes sure that unambiguous city and country names are always added to the doc.ents, even if they're not predicted as an entity by the statistical model. You don't have to solely rely on the model's predictions – combining the predictions with rules is often much more powerful.

This is really difficult to say, because accuracy is relative and depends on what you're evaluating your system against. Ultimately, in the "real world", what matters is whether your system is useful or not.

If you're building an anonymisation system, your model likely won't have the "final word" and instead, you'll be using it to make the process easier for humans and/or to flag potential problems, right? Otherwise, even a model that's 90% accurate on your runtime data (which could be considered pretty state-of-the-art for NER, I guess), would make a mistake on every 10th prediction, which is potentially fatal. After all, a 90% anonymised text is... not anonymised.

1 Like

Hi Ines, thank you so much for the response. It was very helpful!