Problem with new entity type and patterns

Hi,
I want to train a new entity that recognizes Colombia’s turistic locations, but I have a problem with the names that have more than one word(for example: “rio verde de los montes”).

My patterns file looks like this:

    {"label":"COL","pattern":[{"lower":"cerro verdugo"}]}
    {"label":"COL","pattern":[{"lower":"verdinal"}]}
    {"label":"COL","pattern":[{"lower":"rio verdiyaco"}]}
    {"label":"COL","pattern":[{"lower":"rio verde del sinu"}]}
    {"label":"COL","pattern":[{"lower":"rio verde de los montes"}]}

Hi! From looking at the patterns, I can see one potential problem:

{"label":"COL","pattern":[{"lower":"rio verdiyaco"}]}

Patterns are token-based and each dictionary represents one token. So in the example above, the matcher will be looking for one token whose lowercase text matches "rio verdiyaco", which will never be true, because spaCy will split the string into two tokens: "rio" and "verdiyaco".

Instead, you probably want to do something like this:

{"label":"COL","pattern":[{"lower":"rio"}, {"lower": "verdiyaco"}]}

Here's another thread that might be helpful for your project, too. My comment goes into more detail about patterns in and tips for debugging them and using them most effectively to get over the cold start problem:

It works… Thanks for the response.

1 Like