Custom Tokenizer help

alvaro.marlo · December 22, 2022, 4:57pm

Hi,

I am creating a custom tokenizer in Spanish to execute the add_tokens function in an annotations file with spans completed splitting the tokens as I want. I have already updated the tokenizer of the blank:es model to isolate the punctuation marks in different tokens. Examples:

Secretario/a -> ["Secretario", "/", "a"]
Vitoria-Gasteiz -> ["Vitoria", "-", "Gasteiz"]
Hola. -> ["Hola", "."]

I have done this including as prefixes, infixes and suffixes all the symbols available with Regex:

prefixes = nlp.Defaults.prefixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

infixes = nlp.Defaults.infixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

suffixes = nlp.Defaults.suffixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

NOTE: I'm not sure if with this Regex syntax it performs exactly as I want.

But in cases as this text fragment:
Secretario/a.
It splits the tokens in:
["Secretario", "/", "a."]
When I want:
["Secretario", "/", "a", "."]
And with this problem I found future token missmatching errors.

I am pretty sure that this is happening because it interprets .a as part of an acronym. So, how can I change that problem at the Tokenizer and obtain the results I want?

ryanwesslen · December 23, 2022, 1:58pm

Thanks for your question and thanks for reposting your question on the spaCy GitHub discussion. Generally spaCy specific questions are better for that forum as Prodigy Support is best for Prodigy-specific questions.

I'll repost the answer from @polm below for anyone interested.

This is a bit tricky because what you want to do here is remove an existing special case.

First, you can figure out what's going on using the explain method:

nlp.tokenizer.explain("Secretario/a.")
# => [('TOKEN', 'Secretario'), ('INFIX', '/'), ('SPECIAL-1', 'a.')]

SPECIAL refers to a tokenizer exception in this case.

You can remove a tokenizer exception by modifying tokenizer.rules, like this:

rules = nlp.tokenizer.rules
del rules["a."]
nlp.tokenizer.rules = rules

In your case, it sounds like you might want to do this for every lowercase letter. That should be easy to do in a for loop, but if you want to remove more exceptions then it might be better to make your own language subclass where you control the exceptions directly. See the guide to language subclassing or the Spanish definition for an example of what that looks like.

Topic		Replies	Views
Custom tokenization not recursive usage , spacy	0	390	June 5, 2020
Guidance on how to add tokenizer rule spacy , solved	3	3377	July 3, 2018
Infix rule ignored usage , spacy	0	354	March 19, 2020
Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy ner , spacy	8	341	February 19, 2024
Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree... usage , spacy	4	2214	November 2, 2020

Custom Tokenizer help

Related topics