Spacy sentence tokenizer: custom abbreviations case insensitive

revyTH · July 5, 2022, 10:22am

Hi! Is it possible to add custom abbreviations to a tokenizer (eg en_core_web_md) so it works regardless of the case?

Tried with

import spacy
nlp = spacy.load('en_core_web_md')
nlp.tokenizer.add_special_case('Ing.', [{ORTH: 'Ing.'}])  # How to make it case insensitive so it works also for 'ing.', 'ING.' etc

text = 'Ing. Ken is a super hero but ing. Shiro is not.'
doc = nlp(text)
for sent in doc.sents:
    print(sent)

'''
Output:

Ing. Ken is a super hero but ing.
Shiro is not.
'''

adriane · July 5, 2022, 12:20pm

No, there isn't a good way to do this with the tokenizer exceptions, which are case-sensitive. Usually we add case variants of exceptions to handle this.

If this is a big problem for a particular pipeline, then the alternative is to have a custom component right after the tokenizer that retokenizes, e.g. based on PhraseMatcher matches of the things you want to merge into one token. (Be sure to use nlp.make_doc to construct the docs when adding the phrase matcher patterns to that it's using the original tokenizer tokenization as input or you might not get the right matches.)

revyTH · July 5, 2022, 2:28pm

Hi, thanks! Is there any example about PhraseMatcher that I can use as a starting point?

adriane · July 6, 2022, 1:45pm

Thinking about this again, this is also tricky with the PhraseMatcher with LOWER because the tokenization can differ depending on the casing. (See: PhraseMatcher inconsistent matches with attr='LOWER' · Issue #6994 · explosion/spaCy · GitHub)

Let's see, this should mostly work for English by adding all common casing variants as LOWER patterns):

import spacy
from spacy.matcher import PhraseMatcher
from spacy.language import Language
from spacy.util import filter_spans


@Language.factory("exc_retokenizer")
class ExceptionRetokenizer:
    def __init__(self, nlp, name="exc_retokenizer"):
        self.name = name
        self.matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
        for exc in ["ing."]:
            pattern_docs = [
                nlp.make_doc(text)
                for text in [exc, exc.upper(), exc.lower(), exc.title()]
            ]
            self.matcher.add("A", pattern_docs)

    def __call__(self, doc):
        with doc.retokenize() as retokenizer:
            for match in filter_spans(self.matcher(doc, as_spans=True)):
                retokenizer.merge(match)
        return doc


nlp = spacy.blank("en")
nlp.add_pipe("exc_retokenizer")
print([t.text for t in nlp("ING. InG. Ing. ing.")])

The initialization and serialization get trickier once you're not hard-coding the exceptions in __init__, but the basics would be to load the patterns from some data format in initialize (typically JSON) and save the patterns in to_disk/bytes. An example is in this section: https://spacy.io/usage/processing-pipelines#component-data-initialization

Topic		Replies	Views
Tokenizer vs. Abbreviation Set usage , ner , spacy , solved	2	1197	July 26, 2019
Matching patterns lemma and lower ner , spacy , solved	2	927	August 8, 2020
Wrong tokenization on commas preceded by a special character usage , spacy	5	1734	October 4, 2019
PhraseMatcher and spaces usage , spacy	2	1562	December 17, 2017
spacy noun chunk detection ner , spacy	1	641	August 9, 2022

Spacy sentence tokenizer: custom abbreviations case insensitive

Related topics