data-to-spacy --base-model usage

I have the following problem ( assuming, after prodigy update).
We uses custom tokenizer, and uses data-to-spacy utility. To be able to use our custom tokenizer, we have and uses --base-model parameter where we pass empty model (no components in pipe added) with only our custom tokenizer.

Not sure why we expect base model should contain tok2vec component.
Maybe any advises how to fix ?

hi @TatyanaKavalenkaTR,

Thanks for your question.

Can you provide your Prodigy version? Ideally, if you can provide prodigy stats.

I think you may be running into this issue:

I think this was transformers related, but I'll need to check with colleagues.

Also - just curious, can you run spacy debug config? This helps us rule out if there's just an error with your config. Even better, if you could provide the config too, but if not, at least running debug config.

Prodigy version is 1.13.1
Prodigy stats:

Mentioned issue seems for me similar. But I'd like to note, my base model does not contain neither tranformers nor tok2vec

This is output for spacy debug config

Hey @TatyanaKavalenkaTR ,

I think as a first step we should confirm what components does your prodigy_base_model pipeline contain. You say you "expect" it to have tok2vec, but we can confirm it really quick by loading it with spaCy:

import spacy
nlp = spacy.load("prodigy_base_model")
nlp.pipe_names

which should print the names of components.

It should contain a tok2vec embedding layer (or a transformer embedding layer if you use transformers but that is not the case).
If it does contain a tok2vec component, we need to look for a problem inside data-to-spacy. If it doesn't, we should look at how the prodigy_base_model was created. If that's the case it would be good if you could share the steps and the config of the model.
Thanks!

No, I use empty model (no pipe components added ) with only my custom tokenizer.
So pipe_names is an empty list.

My initial goal is to use custom tokenizer during preparing spacy files exported from prodigy dataset as it is mentioned in option description.

Config file attached:
config.html (1.9 KB)

Hi @TatyanaKavalenkaTR,

Sorry for the delay in response. It's true that according to the the docs the --base-model is only used for tokenization and sentence segmentation but the data-to-spacy recipe expects the components for which the training data is being generated to be present. In your case that would be tok2vec and textcat-multilabel .
Additionally the sourcing of the custom tokenizer is currently not automated, you'd have to provide the instruction to source it in the config file.
Here you can find a dedicted post with examples: Train recipe uses different Tokenizer than in ner.manual - #2 by magdaaniol
It's in the context of train recipe but the handling of the --base-modelparameter is the same.
Finally, out of curiosity why do you need a custom tokenizer here?

For training I uses config from this spacy project (no tok2vec component present here) https://github.com/explosion/projects/blob/v3/pipelines/textcat_multilabel_demo/configs/config.cfg

Assume, since I need to train and use only one component, there is no need in a separate tok2vec.