data-to-spacy --base-model usage

TatyanaKavalenkaTR · September 7, 2023, 11:05am

I have the following problem ( assuming, after prodigy update).
We uses custom tokenizer, and uses data-to-spacy utility. To be able to use our custom tokenizer, we have and uses --base-model parameter where we pass empty model (no components in pipe added) with only our custom tokenizer.

Not sure why we expect base model should contain tok2vec component.
Maybe any advises how to fix ?

ryanwesslen · September 7, 2023, 12:11pm

hi @TatyanaKavalenkaTR,

Thanks for your question.

Can you provide your Prodigy version? Ideally, if you can provide prodigy stats.

I think you may be running into this issue:

I think this was transformers related, but I'll need to check with colleagues.

Also - just curious, can you run spacy debug config? This helps us rule out if there's just an error with your config. Even better, if you could provide the config too, but if not, at least running debug config.

TatyanaKavalenkaTR · September 7, 2023, 4:23pm

Prodigy version is 1.13.1
Prodigy stats:

Mentioned issue seems for me similar. But I'd like to note, my base model does not contain neither tranformers nor tok2vec

This is output for spacy debug config

magdaaniol · September 8, 2023, 9:25am

Hey @TatyanaKavalenkaTR ,

I think as a first step we should confirm what components does your prodigy_base_model pipeline contain. You say you "expect" it to have tok2vec, but we can confirm it really quick by loading it with spaCy:

import spacy
nlp = spacy.load("prodigy_base_model")
nlp.pipe_names

which should print the names of components.

It should contain a tok2vec embedding layer (or a transformer embedding layer if you use transformers but that is not the case).
If it does contain a tok2vec component, we need to look for a problem inside data-to-spacy. If it doesn't, we should look at how the prodigy_base_model was created. If that's the case it would be good if you could share the steps and the config of the model.
Thanks!

TatyanaKavalenkaTR · September 8, 2023, 10:20am

No, I use empty model (no pipe components added ) with only my custom tokenizer.
So pipe_names is an empty list.

My initial goal is to use custom tokenizer during preparing spacy files exported from prodigy dataset as it is mentioned in option description.

Config file attached:
config.html (1.9 KB)

magdaaniol · September 13, 2023, 6:18am

Hi @TatyanaKavalenkaTR,

Sorry for the delay in response. It's true that according to the the docs the --base-model is only used for tokenization and sentence segmentation but the data-to-spacy recipe expects the components for which the training data is being generated to be present. In your case that would be tok2vec and textcat-multilabel .
Additionally the sourcing of the custom tokenizer is currently not automated, you'd have to provide the instruction to source it in the config file.
Here you can find a dedicted post with examples: Train recipe uses different Tokenizer than in ner.manual - #2 by magdaaniol
It's in the context of train recipe but the handling of the --base-modelparameter is the same.
Finally, out of curiosity why do you need a custom tokenizer here?

TatyanaKavalenkaTR · September 13, 2023, 7:43pm

For training I uses config from this spacy project (no tok2vec component present here) https://github.com/explosion/projects/blob/v3/pipelines/textcat_multilabel_demo/configs/config.cfg

Assume, since I need to train and use only one component, there is no need in a separate tok2vec.

Topic		Replies	Views
Base model without tok2vec throws error spacy	11	1090	February 23, 2024
Use custom tokenizer in data-to-spacy done , spacy , nightly , training	9	1216	June 17, 2021
`prodigy train` doesn't seem to use the tokenizer from base-model training	2	308	May 1, 2023
Training with base model en_core_web_trf throws error ner	8	456	April 3, 2024
Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy ner , spacy	8	346	February 19, 2024

data-to-spacy --base-model usage

Related topics