Do word vectors have effect on NER accuracy?

Do word vectors have any effect on the accuracy of NER?

If they don’t, how do I remove vectors from existing model that has them?

If they do, does it make sense to replace the vectors with ones created only from my input data? How?

I see noticeable differences with NER accuracy between en_core_web_sm, en_core_web_md and en_core_web_lg models (en_core_web_md outperforms the others). And if the vectors are not the cause, I’d like to remove them to reduce model size and loading time. Or try to use vectors that are relevant to my data, if they have any effect.


The vectors are used as features if present, yes — so training vectors on your own data should be helpful. The best way to achieve this is with the spacy init-model command, which accepts a word vectors file in word2vec or FastText’s plain text format. You might try the FastText vectors from here:

The en_core_web_md model has the same initial vectors as en_core_web_lg, but only keeps the rows for the 20k most frequent words in the vocab. All other words are mapped to their nearest neighbour within those frequent words. This works pretty well: the top 20k most frequent words should cover more than 95% of the tokens in the text, and many other tokens still get some representation. You can activate this setting with the --prune-vectors flag on spacy init-model.