The en_core_web_lg
and en_vectors_web_lg
packages include word vectors, which are used as features in the model. You can easily create your own using any available pretrained vectors, e.g. from FastText: https://v2.spacy.io/usage/vectors-similarity#converting This isn't really "state-of-the-art", but it's a nice efficiency trade-off, because you can easily train these models on your local machine using only a GPU.
In spaCy v3, you can initialize your pipelines with transformer weights (including any embeddings available via the Hugging Face transformers
library). This gives you results right up at the current state of the art for these tasks, so if you've been seeing good results training with vectors only, you'll likely get a boost in accuracy from initialising with transformers embeddings.
spaCy also lets you share a single transformer across multiple components (e.g. ner
and textcat
), which makes your pipelines more efficient. You can try it out by exporting your annotations with data-to-spacy
and converting them to the new v3 format with spacy convert
. You can then generate a training config for your specific requirements (language, components etc.) and train your pipeline with spacy train
.
Make sure to use a separate virtual environment, since the latest stable Prodigy requires spaCy v2. We have a pre-release out that updates Prodigy to spaCy v3 – you can read more about it here: ✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans, improved feeds & more