Train model for certain, repeating mislabelling

Hi!

My model makes a lot of errors that follow a certain pattern (it wrongly classifies the word „ein“ when it is located at the end of a sentence as ORG although it is not an ORG).

Are there any recommendations for dealing with such repeating errors?

Can I simply train a lot of examples where this pattern occurs? And if I did that, could it cause problems by making the dataset unbalanced?

Also, is there a way to find out why this certain problem occurs?

Assuming this is caused by conflicting labels, what is the best way to connect to the DB to look through some training examples?

Thanks!

Leo

I’m guessing this is a German model? If so, and you’re starting out with the German spaCy model, it’s not too surprising that it makes some weird errors, because it was trained on fairly unrepresentative data from Wikipedia. You might try running ner.batch-train from only a blank or vectors-only model, to avoid starting off with weights that might be unideal for your task.

Regardless of the initialization, to answer your question: there are a few ways you could deal with this that could all be equally valid. The simplest one is to hard-code a rule. You can see an example of this here: https://github.com/explosion/spaCy/blob/master/examples/pipeline/fix_space_entities.py . The code in that example implements a rule that prevents space tokens from being tagged as entities. All you have to do to adapt it to your situation is change the conditional on line 19 to something like if token.text == "ein".

If you’d rather have the behaviour integrated into the weights, then you can indeed just add negative examples. I think the problem will probably be resolved by itself as you add more data, so I wouldn’t personally worry about it for now. I think the rule-based will probably be quicker, and it will let you implement other ad-hoc fixes. For now, if you have a better prediction model you’ll be able to annotate faster (as you’ll be able to use ner.make-gold to use the model’s predictions as a starting point.)