Is there a way in Prodigy to only train a Tok2Vec layer based on a dataset and then use this model as Tok2Vec for NER / Spancat using subsets of the data and thus not train an own Tok2Vec while training spans or entities?
Not really, no. Instead you'll want to use the data-to-spacy
utility here and then proceed with spaCy.
The big question here is what objective function you're going to use to train the Tok2Vec layer. Usually you attach a task and then the tok2vec layer learns that task. But if you're training directly, what's the objective?
spaCy does support a pretrain
command that lets you use our "language modelling with approximate outputs" objective. This is a nice compromise for smaller models, so you might find it helpful. If you're going to be pretraining a transformer, we don't have direct support for that currently, as it tends to be a larger job.
Thank you for the reply already. I was thinking about word/token vectors that ideally encompass the entire vocabulary, while the subset that I use to train the NER/Span components does not.
I'm not 100% sure I understand you but I think the spacy pretrain
might indeed be what you're looking for.
It lets you initialize the representations off raw text, which you hopefully have more of than the text you have annotated with NER/span labels.