Do word vectors have any effect on the accuracy of NER?
If they don’t, how do I remove vectors from existing model that has them?
If they do, does it make sense to replace the vectors with ones created only from my input data? How?
I see noticeable differences with NER accuracy between
en_core_web_lg models (
en_core_web_md outperforms the others). And if the vectors are not the cause, I’d like to remove them to reduce model size and loading time. Or try to use vectors that are relevant to my data, if they have any effect.
The vectors are used as features if present, yes — so training vectors on your own data should be helpful. The best way to achieve this is with the
spacy init-model command, which accepts a word vectors file in word2vec or FastText’s plain text format. You might try the FastText vectors from here: https://fasttext.cc/docs/en/english-vectors.html
en_core_web_md model has the same initial vectors as
en_core_web_lg, but only keeps the rows for the 20k most frequent words in the vocab. All other words are mapped to their nearest neighbour within those frequent words. This works pretty well: the top 20k most frequent words should cover more than 95% of the tokens in the text, and many other tokens still get some representation. You can activate this setting with the
--prune-vectors flag on