false positives in Spacy NER

sdevmichael · November 3, 2019, 7:27am

I have trained a Spacy NER model with following training data
TRAIN_DATA = [
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),
("LIBOR Interest rates", {"entities": [(0, 5, "LIBOR_WRD")]}),

]

When I? tested with this sentence "CPIMG is not a technology" , I get "CPIMG" getting detected as LIBOR_WRD . What has made space to detect this as LIBOR_WRD . I do not see the context or neighbour words are same as training data . The only this common is all "CAPITALS" how can i avoide this problem ?

honnibal · November 7, 2019, 11:47am

You can avoid the issue by annotating more data.

The problem here is that you haven't really provided any negative examples --- you've just repeated the same positive example over and over. The model can make any number of generalisations that are compatible with your data: all of the following theories match the examples you've given it:

Every word at the start of a sentence is a LIBOR_WRD
Every word beginning with an L is a LIBOR_WRD
Every word ending in BOR is a LIBOR_WRD
Every word before Interest is a LIBOR_WRD
Every word two words before a word ending in tes is a LIBOR_WRD

etc. Some training tricks such as dropout can help the model generalise slightly, but the model has so little evidence to work with, it's unlikely to come up with a policy that's useful for what you want to do.

If you really just want to recognize the word LIBOR, you can use a patterns dictionary. If you need to infer a more subtle behaviour, you need to give the model enough evidence to work with, by annotating more examples with Prodigy.

Topic		Replies	Views
Questionable results from NER - we must be doing something wrong ner , spacy , best-practices , legal	5	4344	August 30, 2018
Does spacy NER model use POS for modelling enhancement , ner , spacy	3	1220	October 25, 2018
NER for short unstructured text, what am I doing wrong? ner	12	1377	November 27, 2018
Train model for certain, repeating mislabelling usage , ner	1	481	May 28, 2019
Trying to teach NER from blank model for Russian language ner , spacy , solved	3	3199	August 8, 2018

false positives in Spacy NER

Related topics