homonym words AND NER

Hi Again,

I have this question

I have trained a custom NER as you know my entities include

Annotations refer to:
    • LONG: Longitude in Different Formats
    • PARA: Numerical Parameters
    • ASTR: Astronomical Names
    • DATE: Date in Different Formats
    • TIME: Time in Different Format
    • STAR: Names of Stars
    • PLAN: Planet's Names
    • NAME: Names of People and Places
    • GEOM: Geometric Shapes

    I want to add new entity latitude

  • LATI: Related phrases to latitude
  • (this number end with south or north)

    the problem was in entity "ASTR" I have the word "south" and "north"
    which are part of Latitude like this

    as you see in this sentence the model can not recognize "17 25 south" as latitude since it chose "south" as "ASTR". I have more entities from ASTR in compare with LATI., maybe that is the reason...what do you recommend?

    One way is to exclude "south " and "north" from "ASTR". but that is not a smart way and optimum way

    I have no other idea of how it works in case of Homonym words, I mean I feel that solution is not good because I feel even later for recognition of latitude I have the problem since I have more Longitude in corpus and maybe model can not recognize Latitude (since they are close to LONG )

    from the other hand, I feel it should be impossible since BERT is contextual model

    any idea?

I want to know what be your solution to this problem. Imagine I want to capture these entities

-longitude
-latitude

an example that is clarifying:

so longitude normally starts with 2 or 3 digits and finish with zodiac name (Virgo or...)
and latitude start with 2 or 3 digits and finishes with south or north

I wrote two different regex for each part

regex_patterns = [
                  re.compile(r"6° 6¼’ south,19"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\"|”|“]\s?(southern|south|north)"
                              "|\d{1,3}\s?[\'|’|°]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\'|’|\"|”|“]\s?(southern|south|north)")
]
regex_patterns = [
                  re.compile(r"28° 2½’,4|4° 32⅙’22|4° 32⅙’22|4° 2½’—supposing|3° 59½’,so|22° 21’ 31“Aquarius|8° 36’ 50” Gemini2|87° 16\' 30\"8|16° 43\' Taurus10|0° 3\' Leo26|11° 52\',2|12° 38’ Pisces12"
                              "|\d{1,3}\s?[\'|’|°]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\'|’|\"|”|“]\s?\s?[A-Z][a-z.]+(,\d{1,2})?"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]\s?[A-Z][a-z.]+(,\d{1,2})?"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]-?(,\d{1,2})?"
                              "|\d{1,3}\s?[\'|’|°]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\'|’|\"|”|“](,\d{2})?"
                              "|\d{1,3}\s?[\u00BC-\u00BE\u2150-\u215E]?°\s?[A-Z][a-z.]+(,\d{1,2})?")
]
 

first lines are edge cases!

and then used this pre-annotated data for training but since the number of latitudes is less as you see the result is not correct and latitude entities can not recognize as latitude (it is recognized as LONG)

what would be your suggestion?

I had look at the latitude model, I saw it does not work

1-what way do you suggest?
2-how many entities we need to train a model? (for instance for latitude I only have 60 , (verso LONG with 1096) I think it is not enough? correct?)

one possible solution is to merge both entities as coordinate, do you have better idea?

I have united the entities " longitude" and "latitude" to "coordinate" shows by COOR.
as you here:

my reason was the number of Latitude was limited. basically 27 (.3%)

SO I have this:

do you have any other idea? how many samples are needed to train an entity?

since i can provide around 70 latitude sample with different construction like:

1° 49⅔’ north
1° 49 south
1° 53½’ N.
0° 1' 36" S.

but I am not sure it works? do you think it makes sense to have both latitude and longitude?