After annotating a dataset containing ~2000 labels of each of two classes I tried to train fr_core_news_lg
model using ner train
and than ner.correct
to see the results. An average label contains about 10 tokens with regular words, numbers, and abbreviations. Many labels contain dot-terminated word abbreviations, e.g. 'Bull.'
for 'bulletin'
or 'Com.'
for 'commerce'
. It appears the model predictions consistently stop on the word abbreviation with the dot (dot included), or sometimes break on two parts, before and after the dot. On the other hand, commas don't break the predictions.
I looks to me like the model fr_core_news_lg
, being pretrained on a large text corpus, modeled the language such that a dot signifies an end of a information span.
Could you advice on the possible reasons why and on a better way to fix it ?
Just to share some ideas, I considered a possible symbolic fix: removing dots from all abbreviations. Another possible fix is to pretrain the model on a target texts with abbreviations on a general language modeling task like word approximate vector prediction.
One possibility is that the model is actually predicting a sentence boundary, which the entity recogniser is constrained not to cross. You could resolve this by reordering the pipeline so that the NER comes before the dependency parser, so that the sentence boundaries aren't set.
You could also just start from a blank NER model, rather than starting from one that's pretrained. It might perform better on your problem.
1 Like
Thanks, it's a good quick solution.