Prodigy is losing my tokeniser

I created and saved a blank en model with a custom tokeniser - it a bit of a hack in an attempt to create a character model.

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

nlp = spacy.blank('en')
infixes = (r".",)
infix_re = compile_infix_regex(infixes)

nlp.tokenizer = Tokenizer(
                    nlp.vocab, 
                    suffix_search=None,
                    prefix_search=None,
                    infix_finditer=infix_re.finditer,
                    token_match=None)
nlp.add_pipe('ner')
nlp.begin_training()

This works, I can save and load it ok and it maintains the tokenization.

print([t for t in nlp("NAE-03/N2-1.VAV-156.AC_8-6 STA-DUR")])
> [N, A, E, -, 0, 3, /, N, 2, -, 1, ., V, A, V, -, 1, 5, 6, ., A, C, _, 8, -, 6, S, T, A, -, D, U, R]
nlp.to_disk("models/char-blank")
nlp = spacy.load("models/char-blank")
print([t for t in nlp("NAE-03/N2-1.VAV-156.AC_8-6 STA-DUR")])
> [N, A, E, -, 0, 3, /, N, 2, -, 1, ., V, A, V, -, 1, 5, 6, ., A, C, _, 8, -, 6, S, T, A, -, D, U, R]

I've used ner.manual to create a dataset and then trained using train

prodigy train models/model --ner dataset

Problem is when I load the trained model, the tokeniser seems to have changed, i.e.

import spacy
nlp = spacy.load("models/model/model-best")
print([t for t in nlp("NAE-03/N2-1.VAV-156.AC_8-6 STA-DUR")])
> [NAE-03, /, N2, -, 1.VAV-156.AC_8, -, 6, STA, -, DUR]

It seems to be using the default spacy english tokenizer (which fails on my corpus as it's technical IoT data).

I can clearly see the tokenizer file in the models/model/model-best is different from the one from my model.

Also tried

prodigy train models/model --ner dataset --base-model models/char-blank

But the same issue occurred, the tokenizer was changed

I've figured out why this is happening, spacy creates the model to train using the config file, and that's the same as the default en:Blank model, in particular, it contains:

tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

So that's what's returned from the spacy method load_model_from_config - it's a bit confusing initially that it's different to spacy.load but I guess that's because I didn't setup my config correctly.

I created a factory to create my tokenizer

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

@spacy.registry.tokenizers("custom_tokenizer")
def create_custom_tokenizer():
    infixes = (r".",)
    infix_re = compile_infix_regex(infixes)

    def create_tokenizer(nlp):
        return Tokenizer(
                    nlp.vocab, 
                    suffix_search=None,
                    prefix_search=None,
                    infix_finditer=infix_re.finditer,
                    token_match=None)

    return create_tokenizer

And modified the config file

tokenizer = {"custom_tokenizer"}

My outstanding question though is spacy train contains a --code parameter that allows you to pass in registry functions, I don't see an equivalent for the prodigy train recipe? I can from prodigy.recipes.train import train and run it from a script but if I do so from the console it cannot find my registry function?

I double-checked that --base-model preserves the tokenizer config/settings and everything seemed fine with prodigy v1.11.7. Your customized char model will still have "spacy.Tokenizer.v1" in the config because it's the same underlying Tokenizer.

You're right that it doesn't look like there's a simple equivalent for --code with prodigy train. I think the easiest option with prodigy train is to already have the tokenizer configured correctly in the base model.

I don't think you should need a custom tokenizer for this use case, but I think what might be confusing is: if you modify the tokenizer in a loaded pipeline like this:

nlp = spacy.blank("lg")
nlp.tokenizer = CustomTokenizer(args)

... this isn't actually synced in the underlying config in nlp.config. Instead, you'd want to specify the correct tokenizer config with spacy.blank:

nlp = spacy.blank("lg", config={"nlp": {"tokenizer": {"@tokenizers": "my.CustomTokenizer.v1"}}})

Then you can still modify the settings further programmatically if you'd like, but the config has all the information to save and reload this particular tokenizer correctly (well, as long as your custom tokenizer implements the serialization correctly):

nlp.tokenizer.custom_rules = {}

For what you're describing, a registered custom tokenizer isn't really needed and it introduces a lot of overhead. It should be fine to customize the settings for Tokenizer and save that in a base model.

If you move to using spacy directly with prodigy data-to-spacy, there are some other options with spacy assemble or spacy train (customize the tokenizer settings during init: https://spacy.io/usage/training#custom-tokenizer, copy settings from a base model during init: https://spacy.io/api/top-level#copy_from_base_model).