Hi,
I am creating a custom tokenizer in Spanish to execute the add_tokens
function in an annotations file with spans completed splitting the tokens as I want. I have already updated the tokenizer of the blank:es
model to isolate the punctuation marks in different tokens. Examples:
-
Secretario/a
->["Secretario", "/", "a"]
-
Vitoria-Gasteiz
->["Vitoria", "-", "Gasteiz"]
-
Hola.
->["Hola", "."]
I have done this including as prefixes, infixes and suffixes all the symbols available with Regex:
prefixes = nlp.Defaults.prefixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search
infixes = nlp.Defaults.infixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
suffixes = nlp.Defaults.suffixes + [r'''[!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]''']
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
NOTE: I'm not sure if with this Regex syntax it performs exactly as I want.
But in cases as this text fragment:
Secretario/a.
It splits the tokens in:
["Secretario", "/", "a."]
When I want:
["Secretario", "/", "a", "."]
And with this problem I found future token missmatching errors.
I am pretty sure that this is happening because it interprets .a
as part of an acronym. So, how can I change that problem at the Tokenizer and obtain the results I want?