I have the following problem ( assuming, after prodigy update).
We uses custom tokenizer, and uses data-to-spacy utility. To be able to use our custom tokenizer, we have and uses --base-model parameter where we pass empty model (no components in pipe added) with only our custom tokenizer.
Can you provide your Prodigy version? Ideally, if you can provide prodigy stats.
I think you may be running into this issue:
I think this was transformers related, but I'll need to check with colleagues.
Also - just curious, can you run spacy debug config? This helps us rule out if there's just an error with your config. Even better, if you could provide the config too, but if not, at least running debug config.
I think as a first step we should confirm what components does your prodigy_base_model pipeline contain. You say you "expect" it to have tok2vec, but we can confirm it really quick by loading it with spaCy:
It should contain a tok2vec embedding layer (or a transformer embedding layer if you use transformers but that is not the case).
If it does contain a tok2vec component, we need to look for a problem inside data-to-spacy. If it doesn't, we should look at how the prodigy_base_model was created. If that's the case it would be good if you could share the steps and the config of the model.
Thanks!
Sorry for the delay in response. It's true that according to the the docs the --base-model is only used for tokenization and sentence segmentation but the data-to-spacy recipe expects the components for which the training data is being generated to be present. In your case that would be tok2vec and textcat-multilabel .
Additionally the sourcing of the custom tokenizer is currently not automated, you'd have to provide the instruction to source it in the config file.
Here you can find a dedicted post with examples: Train recipe uses different Tokenizer than in ner.manual - #2 by magdaaniol
It's in the context of train recipe but the handling of the --base-modelparameter is the same.
Finally, out of curiosity why do you need a custom tokenizer here?