Custom Tokenizer help

Hi,

I am creating a custom tokenizer in Spanish to execute the add_tokens function in an annotations file with spans completed splitting the tokens as I want. I have already updated the tokenizer of the blank:es model to isolate the punctuation marks in different tokens. Examples:

  • Secretario/a -> ["Secretario", "/", "a"]
  • Vitoria-Gasteiz -> ["Vitoria", "-", "Gasteiz"]
  • Hola. -> ["Hola", "."]

I have done this including as prefixes, infixes and suffixes all the symbols available with Regex:

prefixes = nlp.Defaults.prefixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

infixes = nlp.Defaults.infixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

suffixes = nlp.Defaults.suffixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

NOTE: I'm not sure if with this Regex syntax it performs exactly as I want.

But in cases as this text fragment:
Secretario/a.
It splits the tokens in:
["Secretario", "/", "a."]
When I want:
["Secretario", "/", "a", "."]
And with this problem I found future token missmatching errors.

I am pretty sure that this is happening because it interprets .a as part of an acronym. So, how can I change that problem at the Tokenizer and obtain the results I want?

Thanks for your question and thanks for reposting your question on the spaCy GitHub discussion. Generally spaCy specific questions are better for that forum as Prodigy Support is best for Prodigy-specific questions.

I'll repost the answer from @polm below for anyone interested.


This is a bit tricky because what you want to do here is remove an existing special case.

First, you can figure out what's going on using the explain method:

nlp.tokenizer.explain("Secretario/a.")
# => [('TOKEN', 'Secretario'), ('INFIX', '/'), ('SPECIAL-1', 'a.')]

SPECIAL refers to a tokenizer exception in this case.

You can remove a tokenizer exception by modifying tokenizer.rules, like this:

rules = nlp.tokenizer.rules
del rules["a."]
nlp.tokenizer.rules = rules

In your case, it sounds like you might want to do this for every lowercase letter. That should be easy to do in a for loop, but if you want to remove more exceptions then it might be better to make your own language subclass where you control the exceptions directly. See the guide to language subclassing or the Spanish definition for an example of what that looks like.