Spacy sentence tokenizer: custom abbreviations case insensitive

Hi! Is it possible to add custom abbreviations to a tokenizer (eg en_core_web_md) so it works regardless of the case?

Tried with

import spacy
nlp = spacy.load('en_core_web_md')
nlp.tokenizer.add_special_case('Ing.', [{ORTH: 'Ing.'}])  # How to make it case insensitive so it works also for 'ing.', 'ING.' etc

text = 'Ing. Ken is a super hero but ing. Shiro is not.'
doc = nlp(text)
for sent in doc.sents:
    print(sent)

'''
Output:

Ing. Ken is a super hero but ing.
Shiro is not.
'''

No, there isn't a good way to do this with the tokenizer exceptions, which are case-sensitive. Usually we add case variants of exceptions to handle this.

If this is a big problem for a particular pipeline, then the alternative is to have a custom component right after the tokenizer that retokenizes, e.g. based on PhraseMatcher matches of the things you want to merge into one token. (Be sure to use nlp.make_doc to construct the docs when adding the phrase matcher patterns to that it's using the original tokenizer tokenization as input or you might not get the right matches.)

Hi, thanks! Is there any example about PhraseMatcher that I can use as a starting point?

Thinking about this again, this is also tricky with the PhraseMatcher with LOWER because the tokenization can differ depending on the casing. (See: PhraseMatcher inconsistent matches with attr='LOWER' · Issue #6994 · explosion/spaCy · GitHub)

Let's see, this should mostly work for English by adding all common casing variants as LOWER patterns):

import spacy
from spacy.matcher import PhraseMatcher
from spacy.language import Language
from spacy.util import filter_spans


@Language.factory("exc_retokenizer")
class ExceptionRetokenizer:
    def __init__(self, nlp, name="exc_retokenizer"):
        self.name = name
        self.matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
        for exc in ["ing."]:
            pattern_docs = [
                nlp.make_doc(text)
                for text in [exc, exc.upper(), exc.lower(), exc.title()]
            ]
            self.matcher.add("A", pattern_docs)

    def __call__(self, doc):
        with doc.retokenize() as retokenizer:
            for match in filter_spans(self.matcher(doc, as_spans=True)):
                retokenizer.merge(match)
        return doc


nlp = spacy.blank("en")
nlp.add_pipe("exc_retokenizer")
print([t.text for t in nlp("ING. InG. Ing. ing.")])

The initialization and serialization get trickier once you're not hard-coding the exceptions in __init__, but the basics would be to load the patterns from some data format in initialize (typically JSON) and save the patterns in to_disk/bytes. An example is in this section: https://spacy.io/usage/processing-pipelines#component-data-initialization

2 Likes