Hi! There are different ways to do this and it kinda depends on what your tokenizer does (i.e. whether it only modifies the rule sets or whether it's a fully custom
Tokenizer etc.). The most elegant solution will probably always be to include a reference to a custom function in your config, and provide its code from a file. In Prodigy's recipes, you can use the
-F option to provide a path to a code file to import. In spaCy, you can do this via the
--code argument. If you know your tokenizer isn't going to change, you could also run
spacy package to package your base model and install it in the environment – your custom code will then be included and you don't have to worry about providing it.
If you have a fully custom
Tokenizer, you can add your own
[nlp.tokenizer] block: https://spacy.io/usage/linguistic-features#custom-tokenizer-training
If you just want to modify certain rules before training a new pipeline, you can add a callback that runs before initialization: https://spacy.io/usage/training#custom-tokenizer
That's strange This should be removed before the
nlp object is saved out. Which version of Prodigy are you using and which config/base model did you start with?