Problem with new entity type and patterns

creangel · January 4, 2019, 7:22pm

Hi,
I want to train a new entity that recognizes Colombia’s turistic locations, but I have a problem with the names that have more than one word(for example: “rio verde de los montes”).

My patterns file looks like this:

    {"label":"COL","pattern":[{"lower":"cerro verdugo"}]}
    {"label":"COL","pattern":[{"lower":"verdinal"}]}
    {"label":"COL","pattern":[{"lower":"rio verdiyaco"}]}
    {"label":"COL","pattern":[{"lower":"rio verde del sinu"}]}
    {"label":"COL","pattern":[{"lower":"rio verde de los montes"}]}

ines · January 4, 2019, 11:21pm

Hi! From looking at the patterns, I can see one potential problem:

{"label":"COL","pattern":[{"lower":"rio verdiyaco"}]}

Patterns are token-based and each dictionary represents one token. So in the example above, the matcher will be looking for one token whose lowercase text matches "rio verdiyaco", which will never be true, because spaCy will split the string into two tokens: "rio" and "verdiyaco".

Instead, you probably want to do something like this:

{"label":"COL","pattern":[{"lower":"rio"}, {"lower": "verdiyaco"}]}

Here's another thread that might be helpful for your project, too. My comment goes into more detail about patterns in and tips for debugging them and using them most effectively to get over the cold start problem:

creangel · January 8, 2019, 1:21pm

It works… Thanks for the response.

Topic		Replies	Views
Train a new NER entity with multi-word tokens usage , ner , solved	15	9675	September 10, 2019
ner.manual: issue to recognize multi-words entity containing "-" usage , spacy , solved	2	310	June 15, 2021
Two word NER ner , solved	2	873	November 28, 2018
terms.to-patterns looks strange terms , solved	2	1423	October 23, 2018
Can't use upper-case label in patterns for ner.teach ner	17	1512	August 1, 2018

Problem with new entity type and patterns

Related topics