How to define a custom Tokenizer when using prodigy?

Hello, we've been using prodigy for a while now and had to replace the default tokenizer (a lot).
We had tolerable (hacky but working) solution for spacy 2, but the new config system throws a wrench in it.

Is there an elegant way to append code to the prodigy-spacy training to define the tokenizer, so I can use it in the confs? or a different way?

(Also, when using the prodigy auto-config the spacy cmd spacy package fails due prodigy.ConsoleLogger.v1 still being referenced)

Hi! There are different ways to do this and it kinda depends on what your tokenizer does (i.e. whether it only modifies the rule sets or whether it's a fully custom Tokenizer etc.). The most elegant solution will probably always be to include a reference to a custom function in your config, and provide its code from a file. In Prodigy's recipes, you can use the -F option to provide a path to a code file to import. In spaCy, you can do this via the --code argument. If you know your tokenizer isn't going to change, you could also run spacy package to package your base model and install it in the environment – your custom code will then be included and you don't have to worry about providing it.

If you have a fully custom Tokenizer, you can add your own [nlp.tokenizer] block:

If you just want to modify certain rules before training a new pipeline, you can add a callback that runs before initialization:

That's strange :thinking: This should be removed before the nlp object is saved out. Which version of Prodigy are you using and which config/base model did you start with?

We've since updated to the newest version and can't seem to replicate the problem either.

Using the -F option makes a lot of sense; we've been using it for custom recipes, but somehow didn't make the connection to also use it for custom registries, thanks!

1 Like

Glad it worked! :blush:

This feature is kinda new in v1.11 – it already worked before because all -F really did was import the file, but v1.11. makes this official and also supports multiple files. So you can have one file for your custom recipe and one for your tokenizer and do -F,