Train model for certain, repeating mislabelling

BLP · May 27, 2019, 4:13pm

Hi!

My model makes a lot of errors that follow a certain pattern (it wrongly classifies the word „ein“ when it is located at the end of a sentence as ORG although it is not an ORG).

Are there any recommendations for dealing with such repeating errors?

Can I simply train a lot of examples where this pattern occurs? And if I did that, could it cause problems by making the dataset unbalanced?

Also, is there a way to find out why this certain problem occurs?

Assuming this is caused by conflicting labels, what is the best way to connect to the DB to look through some training examples?

Thanks!

Leo

honnibal · May 28, 2019, 1:07pm

I’m guessing this is a German model? If so, and you’re starting out with the German spaCy model, it’s not too surprising that it makes some weird errors, because it was trained on fairly unrepresentative data from Wikipedia. You might try running ner.batch-train from only a blank or vectors-only model, to avoid starting off with weights that might be unideal for your task.

Regardless of the initialization, to answer your question: there are a few ways you could deal with this that could all be equally valid. The simplest one is to hard-code a rule. You can see an example of this here: https://github.com/explosion/spaCy/blob/master/examples/pipeline/fix_space_entities.py . The code in that example implements a rule that prevents space tokens from being tagged as entities. All you have to do to adapt it to your situation is change the conditional on line 19 to something like if token.text == "ein".

If you’d rather have the behaviour integrated into the weights, then you can indeed just add negative examples. I think the problem will probably be resolved by itself as you add more data, so I wouldn’t personally worry about it for now. I think the rule-based will probably be quicker, and it will let you implement other ad-hoc fixes. For now, if you have a better prediction model you’ll be able to annotate faster (as you’ll be able to use ner.make-gold to use the model’s predictions as a starting point.)

Topic		Replies	Views
Model tagging all texts as labels usage , ner	1	409	July 16, 2019
False Results of Trained models ner , spacy	16	925	March 12, 2019
different dataset for ner.batch-train usage , ner	1	421	August 28, 2019
false positives in Spacy NER usage , spacy	1	1032	November 7, 2019
KeyError: 'token_end' when trying to use ner.batch-train ner , done	9	858	June 7, 2019

Train model for certain, repeating mislabelling

Related topics