I'm using train textcat to train my classifiers. This works fine using a standard spaCy base model. But when I try with en_trf_xlnetbasecased_lg I get an error:
File "/mnt/c/repo/prodigy/recipes/dacs.py", line 407, in train
result = train(**args)
File "/usr/local/lib/python3.7/site-packages/prodigy/recipes/train.py", line 154, in train
nlp.update(docs, annots, drop=dropout, losses=losses)
File "/usr/local/lib/python3.7/site-packages/spacy_transformers/language.py", line 81, in update
tok2vec = self.get_pipe(PIPES.tok2vec)
File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 281, in get_pipe
raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names))
KeyError: "[E001] No component 'trf_tok2vec' found in pipeline. Available names: ['textcat']"
This is currently expected – the transformers classifier is a different text classifier implementation with its own component and component dependencies (token-vector-encoding, tokenization alignment etc.). The underlying problem here is that the train recipe disables all components except for the one you train (which makes sense, because that's the only one you want to update). But that doesn't work for this component, since it has other component dependencies.
You can probably work around it by editing the recipe and the call to nlp.disable_pipes. However, you're probably still better off using the standalone training script we provide in the spacy-transformers repo. To get good results with the transformer models, you typically want to tune the hyperparameters and you probably also want to run it on GPU. Both of this is much easier if you have a standalone script.