data-to-spacy --base-model usage

Hi @TatyanaKavalenkaTR,

Sorry for the delay in response. It's true that according to the the docs the --base-model is only used for tokenization and sentence segmentation but the data-to-spacy recipe expects the components for which the training data is being generated to be present. In your case that would be tok2vec and textcat-multilabel .
Additionally the sourcing of the custom tokenizer is currently not automated, you'd have to provide the instruction to source it in the config file.
Here you can find a dedicted post with examples: Train recipe uses different Tokenizer than in ner.manual - #2 by magdaaniol
It's in the context of train recipe but the handling of the --base-modelparameter is the same.
Finally, out of curiosity why do you need a custom tokenizer here?