Sorry for the delay in response. It's true that according to the the docs the --base-model
is only used for tokenization and sentence segmentation but the data-to-spacy
recipe expects the components for which the training data is being generated to be present. In your case that would be tok2vec
and textcat-multilabel
.
Additionally the sourcing of the custom tokenizer is currently not automated, you'd have to provide the instruction to source it in the config file.
Here you can find a dedicted post with examples: Train recipe uses different Tokenizer than in ner.manual - #2 by magdaaniol
It's in the context of train
recipe but the handling of the --base-model
parameter is the same.
Finally, out of curiosity why do you need a custom tokenizer here?