Custom Word Vectors

If I make my own custom word vectors by using spacy init-model then do I have to retrain the other parts of the pipeline i.e. tokenizer, tagger, parser, ner? Or is there a way for me to add word vectors to an existing spacy model? What is the procedure to go from word vectors say from fasttext to something equivalent functionality of en_core_web_md model? Is the training data for en_core_web_md available (just in case I need to redo the training for all pipeline of an existing model)?

Hi! If word vectors are present during training, they'll be used as features. So if a model was trained with vectors (e.g. en_core_web_md) and you're updating the vectors, you do have to retrain – otherwise, you'll end up with much worse (or even completely useless) results.

The en_core_web_sm model was trained without vectors, so you can add your own vectors to it if you just want vectors for similarity queries etc.

The English models are trained on the OntoNotes 5 data: OntoNotes Release 5.0 - Linguistic Data Consortium It's available for research, but if you want to use it commercially, you have to purchase an LDC membership (which costs like $25k, so not sure if that's a viable option).