Hi!
For my NER model I stumbled across the following problem:
Some companynames I am looking for are typically ending with phrases like “e.V.” or other abbreviations from a limited set (<20).
Because these were often disrupted by the model tokenizer, I found that you can do something like this
from spacy.symbols import ORTH, LEMMA, POS, TAG, NORM
test_nlp = spacy.blank('de')
special_case = [{ORTH: "e.V.", LEMMA: "e.V.", NORM: "eingetragener Verein"}]
test_nlp.tokenizer.add_special_case(u"e.V.", special_case)
which indeed leeds to “e.V.” staying together as one token. I have 4 questions regarding this:
-
When I tried
test_nlp.tokenizer.add_special_case(u"e. V.", special_case)
(attention on the whitespace) ‘e.’ and ‘V.’ were handled as single tokens again. Is there built-in way to tell the tokenizer a pre-defined set of phrases should be left untouched regardless of them containing whitespaces? I believe a custom tokenizer is overkill and it will only cause me further problems when it comes to exporting the model to disk in a way that I can use it in prodigy… -
I found the new retokenizer and tried to understand its use case explained in https://spacy.io/usage/linguistic-features#retokenization. I would have to look up the positions of the parts like “e.” and “V.” (and all other possibilities) in my doc to merge my pairs, right?
-
If there is no easy way to allow for this whitespace-abbreviation-merging, I would guess the simplest way would be a regex based preprocessing that eliminates whitespaces between these abbreviations, or is there a better way to do this in spacy?
-
The
NORM
property can be used to tell my model that different notations of these abbreviations are essentially the same, right? My naive approach would be to setNORM
to a standard extended form of the abbreviation, but I guess there are no phrases allowed (like I did above)? My 2nd guess would be to join these possibilities with a rule based entity approach, or am I missing something here?
I tried to answer some of these questions with the documentation but in articles like https://spacy.io/usage/adding-languages#tokenizer-exceptions further token splitting seems to be a much more common case than keeping things together.
So, as always, I would greatly appreciate your guidance on this matter.
Thank you in advance
KS