@ines
I would be very thankful if someone can answer this question:
I am trying to add
0° 3' 20\" S
for instance in this sentence
We shall now go through the observations1 again, carefully: On 1590 March 4 at 7h 10m, Mars was found by careful observation and calculation to be at 24° 22' 56\" Aries with latitude 0° 3' 20\" S. At that time, 8° Aries was setting, so Mars was rather low
S is a brief for South
as entity latitude to my recent model
but I faced with this "misaligned problem", I remember once I solved this for my other entity that followed by a comma or dot (it was something like 100,00, or 100,000.
Hi, the default English tokenizer treats S. as one token (for tokens like middle initials in names). If you want S. to always be two tokens, you can modify the suffix regex or add exceptions. In this case, I think adding exceptions for N/S/E/W might be the easiest approach:
nlp = spacy.load("en_core_web_sm")
for l in ["N", "S", "E", "W"]:
nlp.tokenizer.add_special_case(l + ".", [{"ORTH": l}, {"ORTH": "."}])
You can save this model to disk and then use the path to this directory instead of en_core_web_sm with prodigy or spacy:
thank you for your prompt response, If I want to this in the context of training the model using the prodigy, in which parts I should use this, I mean I understand that for instance for "Misaligned token" I should uses this probably here:
def misaligned_token(examples):
counter=0
nlp = spacy.load("en_core_web_sm")
for example in examples:
doc = nlp(example["text"])
for span in example["spans"]:
char_span = doc.char_span(span["start"], span["end"])
if char_span is None:
counter+=1
print("{}- Misaligned tokens-->".format(counter), example["text"], span)
but how can I call it the model for trainng?
should I also change this script:
The main difference is that instead of just the model name (en_core_web_sm), you need the path to the model. Relative paths are possible, but to keep things simpler I'd recommend using the full path everywhere, so something like:
You do need to be careful to use the new model everywhere with this dataset or you might end up with inconsistent annotation.
The tokenizer settings are saved as part of the model when you train with spacy or prodigy, so you can distribute the new model and it will work without any additional customizations.