I am trying to train a from scratch NER model with custom labels. I have word vectors that are pretrained from Gensim on a large corpus of r/wallstreetbets data. I need help determining the workflow from here to creating a preliminary model.
Right now I have the vectors as well as a labeled dataset containing 270 examples in my database. I'd like to train a model using my word vectors that will then be used with the ner.correct recipe.
Hi! If you have your vectors exported in word2vec text format from gensim (save_word2vec_format), you can initialize a base model with spacy init vectors:
python -m spacy init vectors en /path/to/vectors.vec /path/to/spacy_vectors
Then use /path/to/spacy_vectors as the base model when training with prodigy:
Ah, wait, I was wrong about the prodigy side of things. Only using --base-model doesn't actually enable the vectors in the new ner component while training in prodigy by default. Let me have a look...
Edited to add:
One option is to generate a config with vectors using spacy init config -o accuracy and the set the vectors location in a prodigy train override:
spacy init config -l en -p ner -o accuracy /path/to/config.cfg
prodigy train --ner dataset --config /path/to/config.cfg --initialize.vectors /path/to/spacy_vectors