Base model without tok2vec throws error

Hello !

I have an issue since the version v1.11.12. As is stated in the doc, some bug was fixed around the --base-model usage. When I try to use a base model for NER on a simple dataset (I'm using fr_dep_news_trf) prodigy returns the following error :

KeyError: "[E001] No component 'tok2vec' found in pipeline. Available names: ['transformer', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer']"

Which is... normal actually, since the fr_dep_news_trf model does not have any tok2vec component ! It seems like the prodigy train recipe assumes that any model used for training has a tok2vec component, even though it seems that the new spacy-transformers allows a new norm, using a transformer component directly instead of a tok2vec.

Here is the command I use for reference :

prodigy train ./outputs/fr_model/ --ner my_dataset --base-model fr_dep_news_trf --gpu-id 0

my_dataset contains only 'ORG' annotations for a simple NER model, as it is what I need to identify.

Am I wrong ? Is there a way to bypass this issue, or should I rewrite a training recipe to suit my needs ?

Kind regards,

We're aware of that error. There's an issue with transformers that's unrelated to the bug fixed in v1.11.12. It's also a bug that's more upstream, caused by an issue in the spaCy codebase. The team is aware of it though and is currently working on a fix.

Another thread on this issue can be found here:

That thread also has a temporary workaround that involves writing a custom config.cfg file. It's not a proper solution, but might serve as a remedy for now.

1 Like

Thank you for the quick response ! I tried to find previous threads but didn't find this one. I will look into their temporary solution then, and look forward to the team fixing this. I guess I can delete this topic then ?

1 Like

It can't hurt to leave open now, since a link to the other example exists.

Now that I think of it .... it might even be better to leave the topic open because, as you say, you weren't able to find the thread yourself. If we keep this open, maybe Google/Discourse will have an easier time indexing more appropriate keywords that eventually lead to the right thread.