`prodigy train` doesn't seem to use the tokenizer from base-model

I've created a base model that modifies the default spacy tokenizer. When I use the Prodigy NER/span manual recipes on this base model, the tokenization works as expected. After doing some annotations, I want to train a new model and use the span.correct recipe. When I do this, Prodigy uses the default tokenizer instead of my custom one, which makes sense. I saw in the documentation that a base-model can be specified:

--base-model, -m | str | Optional spaCy pipeline to update or use for tokenization and sentence segmentation.

However, even when I specify my base model, Prodigy doesn't seem to respect the custom tokenizer rules. Am I misunderstanding something here about base-model is supposed to do, and is there some other way to use my custom tokenizer in my trained model?

My current hacky solution is to run a script each time after I've trained the model, which opens both models (base and trained), sets trained.tokenizer to base.tokenizer, then saves the trained model back out.

I just came across this in the spaCy docs, which mostly answers my question:

If you’ve loaded a trained pipeline, writing to the nlp.Defaults or English.Defaults directly won’t work, since the regular expressions are read from the pipeline data and will be compiled when you load it. If you modify nlp.Defaults , you’ll only see the effect if you call spacy.blank . If you want to modify the tokenizer loaded from a trained pipeline, you should modify nlp.tokenizer directly. If you’re training your own pipeline, you can register callbacks to modify the nlp object before training.

It looks like prodigy train can take an optional spaCy config file, so this seems like a proper solution to my problem, but I'm still confused about why base-model wouldn't be used for tokenization though, given the documentation I quoted in my original post. :thinking:

Hi @doppio ,

You're on the right track regarding the config file. For the tokenizer, the best way to go is to pass a config file that references the base model's tokenizer in the prodigy train ... --config [CONFIG] parameter. In addition, you can also check out the copy_from_base_model option in the config. You can incorporate this in the initialize.before_init option of your config, something like this:

# Inside your .cfg file
...
[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "your_base_model"
vocab = "your_base_model"
...

As for including the tokenizer when passing --base-model, that's something we definitely want to automate in the future!

1 Like