Model only recognizes part of the entity in coordinates

robertto · August 29, 2019, 1:21pm

Dear All,

I have trained a customized model, It gives a very good classification report, but when I used the model on my text, sometimes it recognizes part of the entity like here:
"

he factory position was 16° LONG 50’ 30“ they need to find a way to go to the Mars PLAN .
"

it says only 16 is a coordinate, but actually all the number should be cooridnate, can you let me know why?

ines · August 29, 2019, 1:43pm

Models are statistical and what they predict depends on many factors, including the examples seen during training. Maybe your data didn't contain enough examples of those longer coordinates.

Maybe some of those sequences are also too diffuclt to learn. For the coordinates that follow a consistent pattern, you might want to try a combination of predictions plus rules to fix the mistakes the model makes. The spaCy docs on models plus rules are a good place to start: https://spacy.io/usage/rule-based-matching#models-rules

robertto · August 29, 2019, 1:57pm

thank you for the prompt response, it is also my guess, I have seen the link that you sent, but kind of lost how can I start, I have spaCy model and a raw text, how can I add for example:

regex_patterns = [
                  re.compile(r"\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]\s?[A-Z][a-z.]+"   #Coordinate in format 
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\s?[A-Z][a-z.]+"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]"
                              "|\d{1,3}\s?°\s?[A-Z][a-z.]+")

this pattern to my model. however the whole process it is a bit wired for me, since I have provided my annotation by regex first.

ines · August 29, 2019, 2:30pm

The link to the documentation I posted above is a good place to start. The page also has a section on matching regular expressions: https://spacy.io/usage/rule-based-matching#regex-text

robertto · August 29, 2019, 3:22pm

i have done

import spacy
import re
doc = nlp(str1)
"""
regex_patterns = [
                  re.compile(r"\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]\s?[A-Z][a-z.]+"   #Coordinate in format 
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\s?[A-Z][a-z.]+"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]"
                              "|\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]"
                              "|\d{1,3}\s?°\s?[A-Z][a-z.]+")]
"""
expression = r"\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\"|”|“]\s?[A-Z][a-z.]+"

for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

it gives me some missing entities, but I do not know how to add this to model (or myabe in weaker version how to execute to a text data a saved the result)

Many thanks again

Topic		Replies	Views
merging a data annotated by regex with the annotated data by prodigy usage , ner , spacy	1	513	August 7, 2019
NER or PhraseMatcher? ner , spacy , best-practices	17	6163	September 20, 2018
Improving NER for label Coordinate usage , ner	3	417	July 22, 2020
Training NER model from scratch using (forward-looking) patterns usage	8	741	December 17, 2019
Off-track use of Prodigy/Spacy - Custom Regex Pattern Matching and Modeling usage , ner , spacy , custom	35	7859	February 4, 2019

Model only recognizes part of the entity in coordinates

Related topics