Tokenizer vs. Abbreviation Set


For my NER model I stumbled across the following problem:
Some companynames I am looking for are typically ending with phrases like “e.V.” or other abbreviations from a limited set (<20).

Because these were often disrupted by the model tokenizer, I found that you can do something like this

from spacy.symbols import ORTH, LEMMA, POS, TAG, NORM
test_nlp = spacy.blank('de')

special_case = [{ORTH: "e.V.", LEMMA: "e.V.", NORM: "eingetragener Verein"}]
test_nlp.tokenizer.add_special_case(u"e.V.", special_case)

which indeed leeds to “e.V.” staying together as one token. I have 4 questions regarding this:

  1. When I tried test_nlp.tokenizer.add_special_case(u"e. V.", special_case) (attention on the whitespace) ‘e.’ and ‘V.’ were handled as single tokens again. Is there built-in way to tell the tokenizer a pre-defined set of phrases should be left untouched regardless of them containing whitespaces? I believe a custom tokenizer is overkill and it will only cause me further problems when it comes to exporting the model to disk in a way that I can use it in prodigy…

  2. I found the new retokenizer and tried to understand its use case explained in I would have to look up the positions of the parts like “e.” and “V.” (and all other possibilities) in my doc to merge my pairs, right?

  3. If there is no easy way to allow for this whitespace-abbreviation-merging, I would guess the simplest way would be a regex based preprocessing that eliminates whitespaces between these abbreviations, or is there a better way to do this in spacy?

  4. The NORM property can be used to tell my model that different notations of these abbreviations are essentially the same, right? My naive approach would be to set NORM to a standard extended form of the abbreviation, but I guess there are no phrases allowed (like I did above)? My 2nd guess would be to join these possibilities with a rule based entity approach, or am I missing something here?

I tried to answer some of these questions with the documentation but in articles like further token splitting seems to be a much more common case than keeping things together.

So, as always, I would greatly appreciate your guidance on this matter.
Thank you in advance


No, that's pretty much the only built-in assumption in the tokenization algorithm. The tokenizer exceptions are always going to be instructions for how to split strings after the initial text was split on simple spaces. So if you wanted "e.V." to be ["e", ".", "V", "."], you'd use a special case rule for that.

It sounds like it might be overkill, yes! In general, though, if you only modify the tokenization tules (prefix/suffix/infix rules and exceptions), those will be serialized with the tokenizer when you save out the nlp object. So it's then very easy to use the resulting model with your custom tokenization rules in Prodigy.

You can use the rule-based matcher for that and have a pattern like [{"LOWER": "e."}, {"LOWER": "v."} (or something similar, depending on the tokens spaCy produces and whether you want the matcher to be case-insensitive). This will give you the start and end tokens and will let you create a span object for the given match. You can then use the retokenizer to merge those tokens together. This could be a nice custom pipeline component that runs first in the pipeline.

One of the core principles of spaCy's Doc object is that you should always be able to reconstruct the original input text. This means that the token texts are immutable (also see my comment here for more details).

That said, it can sometimes be a good idea to normalise or preprocess your texts before you feed them to spaCy (also see textacy). Just make sure that you apply the same preprocessing for training, evaluation and at runtime.

Once you've merged the two tokens into one and you end up with one token "e. V.", you can set its norm to "eingetragener Verein" – however, I'm not really sure that's a good thing to do in this case. It makes sense for alternate spellings like "e.V." vs. "e. V." vs. "e.v.", which you want the model to treat as "the same thing".

But outside of your custom norm, there'll never be another token with the norm "eingetragener Verein", because those would always become two tokens. And even if you did merge those, "e.V." and "eingetragener Verein" do hold very different semantic clues that you should probably not normalise, because you want your model to pick up on them. For instance: "Caritas e.V. ist ein eingetragener Verein". "e.V." should be much more likely to be a proper noun, part of an entity etc.

Thank you so much for your very detailed answer and the hint to textacy. I will take a look at that and try to normalize these special cases in the preprocessing, because for my use case there is no information gain for the model in distinguishing between “e.V.”, “e. V.” and “e. v.”.

Regarding the NORM property: Yes, I can understand your objection and I will stick to a standardized but still abbreviated form like NORM="e.V." for all the abbreviation variants. Like you mentioned, the written out form might contain different contextual information.

BTW, I don’t know how often you get this feedback instead of the tons of nagging questions, but a big :+1: and :heart: for your and Matt’s outstanding support!

1 Like