homonym words AND NER

robertto · February 17, 2020, 2:37pm

Hi Again,

I have this question

I have trained a custom NER as you know my entities include

Annotations refer to:

LONG: Longitude in Different Formats
PARA: Numerical Parameters
ASTR: Astronomical Names
DATE: Date in Different Formats
TIME: Time in Different Format
STAR: Names of Stars
PLAN: Planet's Names
NAME: Names of People and Places
GEOM: Geometric Shapes

I want to add new entity latitude

LATI: Related phrases to latitude

the problem was in entity "ASTR" I have the word "south" and "north"
which are part of Latitude like this

as you see in this sentence the model can not recognize "17 25 south" as latitude since it chose "south" as "ASTR". I have more entities from ASTR in compare with LATI., maybe that is the reason...what do you recommend?

One way is to exclude "south " and "north" from "ASTR". but that is not a smart way and optimum way

I have no other idea of how it works in case of Homonym words, I mean I feel that solution is not good because I feel even later for recognition of latitude I have the problem since I have more Longitude in corpus and maybe model can not recognize Latitude (since they are close to LONG )

from the other hand, I feel it should be impossible since BERT is contextual model

any idea?

robertto · February 17, 2020, 5:07pm

I want to know what be your solution to this problem. Imagine I want to capture these entities

-longitude
-latitude

an example that is clarifying:

so longitude normally starts with 2 or 3 digits and finish with zodiac name (Virgo or...)
and latitude start with 2 or 3 digits and finishes with south or north

I wrote two different regex for each part

regex_patterns = [
                  re.compile(r"6° 6¼’ south,19"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\"|”|“]\s?(southern|south|north)"
                              "|\d{1,3}\s?[\'|’|°]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\'|’|\"|”|“]\s?(southern|south|north)")
]

regex_patterns = [
                  re.compile(r"28° 2½’,4|4° 32⅙’22|4° 32⅙’22|4° 2½’—supposing|3° 59½’,so|22° 21’ 31“Aquarius|8° 36’ 50” Gemini2|87° 16\' 30\"8|16° 43\' Taurus10|0° 3\' Leo26|11° 52\',2|12° 38’ Pisces12"
                              "|\d{1,3}\s?[\'|’|°]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\'|’|\"|”|“]\s?\s?[A-Z][a-z.]+(,\d{1,2})?"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]\s?[A-Z][a-z.]+(,\d{1,2})?"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]-?(,\d{1,2})?"
                              "|\d{1,3}\s?[\'|’|°]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\'|’|\"|”|“](,\d{2})?"
                              "|\d{1,3}\s?[\u00BC-\u00BE\u2150-\u215E]?°\s?[A-Z][a-z.]+(,\d{1,2})?")
]

first lines are edge cases!

and then used this pre-annotated data for training but since the number of latitudes is less as you see the result is not correct and latitude entities can not recognize as latitude (it is recognized as LONG)

what would be your suggestion?

I had look at the latitude model, I saw it does not work

1-what way do you suggest?
2-how many entities we need to train a model? (for instance for latitude I only have 60 , (verso LONG with 1096) I think it is not enough? correct?)

one possible solution is to merge both entities as coordinate, do you have better idea?

robertto · February 18, 2020, 9:27am

I have united the entities " longitude" and "latitude" to "coordinate" shows by COOR.
as you here:

my reason was the number of Latitude was limited. basically 27 (.3%)

SO I have this:

do you have any other idea? how many samples are needed to train an entity?

since i can provide around 70 latitude sample with different construction like:

1° 49⅔’ north
1° 49 south
1° 53½’ N.
0° 1' 36" S.

but I am not sure it works? do you think it makes sense to have both latitude and longitude?

Topic		Replies	Views
Model only recognizes part of the entity in coordinates usage , ner , spacy , solved	4	408	August 29, 2019
Advice on training NER models with new entities usage , ner , hr	13	3885	January 25, 2019
project NER help usage , ner , best-practices	2	661	December 3, 2020
Train a new NER entity with multi-word tokens usage , ner , solved	15	9674	September 10, 2019
Questionable results from NER - we must be doing something wrong ner , spacy , best-practices , legal	5	4345	August 30, 2018

homonym words AND NER

Related topics