Model only recognizes part of the entity in coordinates

Dear All,

I have trained a customized model, It gives a very good classification report, but when I used the model on my text, sometimes it recognizes part of the entity like here:
"

he factory position was 16° LONG 50’ 30“ they need to find a way to go to the Mars PLAN .
"

it says only 16 is a coordinate, but actually all the number should be cooridnate, can you let me know why?

Models are statistical and what they predict depends on many factors, including the examples seen during training. Maybe your data didn't contain enough examples of those longer coordinates.

Maybe some of those sequences are also too diffuclt to learn. For the coordinates that follow a consistent pattern, you might want to try a combination of predictions plus rules to fix the mistakes the model makes. The spaCy docs on models plus rules are a good place to start: https://spacy.io/usage/rule-based-matching#models-rules

1 Like

thank you for the prompt response, it is also my guess, I have seen the link that you sent, but kind of lost how can I start, I have spaCy model and a raw text, how can I add for example:

regex_patterns = [
                  re.compile(r"\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]\s?[A-Z][a-z.]+"   #Coordinate in format 
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\s?[A-Z][a-z.]+"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]"
                              "|\d{1,3}\s?°\s?[A-Z][a-z.]+")           

this pattern to my model. however the whole process it is a bit wired for me, since I have provided my annotation by regex first.

The link to the documentation I posted above is a good place to start. The page also has a section on matching regular expressions: https://spacy.io/usage/rule-based-matching#regex-text

1 Like

i have done

import spacy
import re
doc = nlp(str1)
"""
regex_patterns = [
                  re.compile(r"\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]\s?[A-Z][a-z.]+"   #Coordinate in format 
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\s?[A-Z][a-z.]+"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]"
                              "|\d{1,3}\s?°\s?[A-Z][a-z.]+")]
"""
expression = r"\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]\s?[A-Z][a-z.]+"

for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

it gives me some missing entities, but I do not know how to add this to model (or myabe in weaker version how to execute to a text data a saved the result)

Many thanks again