misaligned token

@ines
I would be very thankful if someone can answer this question:

I am trying to add

 0° 3' 20\" S

for instance in this sentence

We shall now go through the observations1 again, carefully: On 1590 March 4 at 7h 10m, Mars was found by careful observation and calculation to be at 24° 22' 56\" Aries with latitude 0° 3' 20\" S. At that time, 8° Aries was setting, so Mars was rather low

S is a brief for South
as entity latitude to my recent model

but I faced with this "misaligned problem", I remember once I solved this for my other entity that followed by a comma or dot (it was something like 100,00, or 100,000.

I do not why it give me the misaligned

I used this pattern

regex_patterns = [
                  re.compile(r"\d{1,3}\s?°\s?\d{1,2}\s?[\'|’]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\"|”|“]\s? (N|S|south|north)+(,\d{1,2})?"
                              "|\d{1,3}\s?[\'|’|°]\s?\d{1,2}\s?[\u00BC-\u00BE\u2150-\u215E]?[\'|’|\"|”|“]\s?(N|S|south|north)+(,\d{1,2})?" )
]

as you see it can capture correctly the desired string

I think it is some kind of problem of space before S? is it correct? how can I capture the correct string without facing a misaligned problem?

Best

Hi, the default English tokenizer treats S. as one token (for tokens like middle initials in names). If you want S. to always be two tokens, you can modify the suffix regex or add exceptions. In this case, I think adding exceptions for N/S/E/W might be the easiest approach:

nlp = spacy.load("en_core_web_sm")
for l in ["N", "S", "E", "W"]:
    nlp.tokenizer.add_special_case(l + ".", [{"ORTH": l}, {"ORTH": "."}])

You can save this model to disk and then use the path to this directory instead of en_core_web_sm with prodigy or spacy:

nlp.to_disk("/path/to/mod_en_core_web_sm")
nlp = spacy.load("/path/to/mod_en_core_web_sm")
1 Like

thank you for your prompt response, If I want to this in the context of training the model using the prodigy, in which parts I should use this, I mean I understand that for instance for "Misaligned token" I should uses this probably here:

def misaligned_token(examples):
    counter=0
    nlp = spacy.load("en_core_web_sm")  
    for example in examples:  
        doc = nlp(example["text"])
        for span in example["spans"]:
            char_span = doc.char_span(span["start"], span["end"])
            if char_span is None:  
                counter+=1
                print("{}- Misaligned tokens-->".format(counter), example["text"], span)

but how can I call it the model for trainng?
should I also change this script:

python –m prodigy ner.batch-train an_ner_date_01 en_core_web_sm  --output model_date_01 --n-iter 10 --eval-split 0.2 --dropout 0.2 -–no-missing

I mean everywhere that I used this en_core_web_sm

I should change to

mod_en_core_web_sm

I mean if the answer is yes, how can I call then by prodigy comment, something like this:

python -m prodigy ner.batch-train data_merged_v15 mod_en_core_web_sm  --output Model_U28 --n-iter 30 --eval-split 0.2 --dropout 0.2 --no-missing

The main difference is that instead of just the model name (en_core_web_sm), you need the path to the model. Relative paths are possible, but to keep things simpler I'd recommend using the full path everywhere, so something like:

python -m prodigy ner.batch-train data_merged_v15 /home/user/models/mod_en_core_web_sm  --output Model_U28 --n-iter 30 --eval-split 0.2 --dropout 0.2 --no-missing

You do need to be careful to use the new model everywhere with this dataset or you might end up with inconsistent annotation.

The tokenizer settings are saved as part of the model when you train with spacy or prodigy, so you can distribute the new model and it will work without any additional customizations.

1 Like

@adriane

many thanks for your response, I will try it , then I will back to you, BTW can you have a look at this

https://support.prodi.gy/t/annotation-advice-custom-ner/2558/2

many tnx