data-to-spacy --base-model usage

magdaaniol · September 13, 2023, 6:18am

Sorry for the delay in response. It's true that according to the the docs the --base-model is only used for tokenization and sentence segmentation but the data-to-spacy recipe expects the components for which the training data is being generated to be present. In your case that would be tok2vec and textcat-multilabel .
Additionally the sourcing of the custom tokenizer is currently not automated, you'd have to provide the instruction to source it in the config file.
Here you can find a dedicted post with examples: Train recipe uses different Tokenizer than in ner.manual - #2 by magdaaniol
It's in the context of train recipe but the handling of the --base-modelparameter is the same.
Finally, out of curiosity why do you need a custom tokenizer here?

Topic		Replies	Views
Base model without tok2vec throws error spacy	11	1087	February 23, 2024
Use custom tokenizer in data-to-spacy done , spacy , nightly , training	9	1213	June 17, 2021
Error when using data-to-spacy done , nightly	3	472	June 28, 2021
`prodigy train` doesn't seem to use the tokenizer from base-model training	2	307	May 1, 2023
Issue getting Tranformer-based NER pipeline working usage , spacy , transformers	3	1249	January 29, 2021

data-to-spacy --base-model usage

Related topics