Similar models to en_core_web_lg/en_vectors_web_lg

I have been using en_core_web_lg/en_vectors_web_lg for the Prodigy train recipe for both ner and textcat. I find them very impressive.

I was wondering whether you have documentation of any other similar 3rd party models that can be used in its place. Are there any specific state-of-the-art models (from organizations like Stanford, AllenNLP, Flair, or SparkNLP) for ner and textcat that plugs-in easily with train recipe?

The en_core_web_lg and en_vectors_web_lg packages include word vectors, which are used as features in the model. You can easily create your own using any available pretrained vectors, e.g. from FastText: https://v2.spacy.io/usage/vectors-similarity#converting This isn't really "state-of-the-art", but it's a nice efficiency trade-off, because you can easily train these models on your local machine using only a GPU.

In spaCy v3, you can initialize your pipelines with transformer weights (including any embeddings available via the Hugging Face transformers library). This gives you results right up at the current state of the art for these tasks, so if you've been seeing good results training with vectors only, you'll likely get a boost in accuracy from initialising with transformers embeddings.

spaCy also lets you share a single transformer across multiple components (e.g. ner and textcat), which makes your pipelines more efficient. You can try it out by exporting your annotations with data-to-spacy and converting them to the new v3 format with spacy convert. You can then generate a training config for your specific requirements (language, components etc.) and train your pipeline with spacy train.

Make sure to use a separate virtual environment, since the latest stable Prodigy requires spaCy v2. We have a pre-release out that updates Prodigy to spaCy v3 – you can read more about it here: ✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans, improved feeds & more

1 Like

Thanks so much for the response, @ines. This is indeed very helpful.

@ines,

I followed the steps to train pipeline with spacy (v3) train and unfortunately, I get this error:
ValueError: [E913] Corpus path can't be None. Maybe you forgot to define it in your config.cfg or override it on the CLI?

As you suggested, I did the following:
prodigy data-to-spacy ./dataset_rhetorics_1.json --lang en --textcat dataset_rhetorics_1

and then in spacy v3 environment, ran the following:
python -m spacy convert ./dataset_rhetorics_1.json . --converter json --lang en

Finally, I ran this line:
python -m spacy train config.cfg --output ./output --paths.train ./dataset_rhetorics_1.spacy --verbose

This throws the ValueError: [E913].

Not sure what I am doing wrong.

I think you forgot to overwrite the --paths.dev, i.e. development data to evaluate the model on. If you don't have a dedicated evaluation set yet, you can also just take the JSON data you exported and split it.

1 Like

Many thanks, @ines for pointing to my overlook. This should resolve the issue.