How to define a custom Tokenizer when using prodigy?

gku · September 13, 2021, 11:50am

Hello, we've been using prodigy for a while now and had to replace the default tokenizer (a lot).
We had tolerable (hacky but working) solution for spacy 2, but the new config system throws a wrench in it.

Is there an elegant way to append code to the prodigy-spacy training to define the tokenizer, so I can use it in the confs? or a different way?

(Also, when using the prodigy auto-config the spacy cmd spacy package fails due prodigy.ConsoleLogger.v1 still being referenced)

ines · September 16, 2021, 1:10am

Hi! There are different ways to do this and it kinda depends on what your tokenizer does (i.e. whether it only modifies the rule sets or whether it's a fully custom Tokenizer etc.). The most elegant solution will probably always be to include a reference to a custom function in your config, and provide its code from a file. In Prodigy's recipes, you can use the -F option to provide a path to a code file to import. In spaCy, you can do this via the --code argument. If you know your tokenizer isn't going to change, you could also run spacy package to package your base model and install it in the environment – your custom code will then be included and you don't have to worry about providing it.

If you have a fully custom Tokenizer, you can add your own [nlp.tokenizer] block: Linguistic Features · spaCy Usage Documentation

If you just want to modify certain rules before training a new pipeline, you can add a callback that runs before initialization: Training Pipelines & Models · spaCy Usage Documentation

That's strange This should be removed before the nlp object is saved out. Which version of Prodigy are you using and which config/base model did you start with?

gku · September 16, 2021, 10:25am

We've since updated to the newest version and can't seem to replicate the problem either.

Using the -F option makes a lot of sense; we've been using it for custom recipes, but somehow didn't make the connection to also use it for custom registries, thanks!

ines · September 20, 2021, 5:50am

Glad it worked!

This feature is kinda new in v1.11 – it already worked before because all -F really did was import the file, but v1.11. makes this official and also supports multiple files. So you can have one file for your custom recipe and one for your tokenizer and do -F recipe.py,tokenizer.py.

Topic		Replies	Views
`prodigy train` doesn't seem to use the tokenizer from base-model training	2	307	May 1, 2023
Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy ner , spacy	8	341	February 19, 2024
Use custom tokenizer in data-to-spacy recipe ner , spacy	1	370	January 18, 2023
Training after annotating with custom tokenizer spacy , transformers , training	3	589	November 8, 2023
Prodigy is losing my tokeniser usage , spacy	2	419	February 18, 2022

How to define a custom Tokenizer when using prodigy?

Related topics