Punctuation (dot) breaks entity prediction on two

ingvarvg · March 16, 2021, 9:25pm

After annotating a dataset containing ~2000 labels of each of two classes I tried to train fr_core_news_lg model using ner train and than ner.correct to see the results. An average label contains about 10 tokens with regular words, numbers, and abbreviations. Many labels contain dot-terminated word abbreviations, e.g. 'Bull.' for 'bulletin' or 'Com.' for 'commerce'. It appears the model predictions consistently stop on the word abbreviation with the dot (dot included), or sometimes break on two parts, before and after the dot. On the other hand, commas don't break the predictions.
I looks to me like the model fr_core_news_lg, being pretrained on a large text corpus, modeled the language such that a dot signifies an end of a information span.
Could you advice on the possible reasons why and on a better way to fix it ?
Just to share some ideas, I considered a possible symbolic fix: removing dots from all abbreviations. Another possible fix is to pretrain the model on a target texts with abbreviations on a general language modeling task like word approximate vector prediction.

honnibal · March 17, 2021, 1:22pm

One possibility is that the model is actually predicting a sentence boundary, which the entity recogniser is constrained not to cross. You could resolve this by reordering the pipeline so that the NER comes before the dependency parser, so that the sentence boundaries aren't set.

You could also just start from a blank NER model, rather than starting from one that's pretrained. It might perform better on your problem.

ingvarvg · March 18, 2021, 8:55am

Thanks, it's a good quick solution.

Topic		Replies	Views
Spacy tags punctuations usage , ner , spacy , solved	3	548	November 19, 2018
NER model from scratch (strange behaviour) usage , ner , spacy	7	451	October 13, 2020
Model tagging all texts as labels usage , ner	1	409	July 16, 2019
Train model for certain, repeating mislabelling usage , ner	1	481	May 28, 2019
NER detection and comma (,) ner	5	2134	March 28, 2018

Punctuation (dot) breaks entity prediction on two

Related topics