Loading fasttext vectors to spacy/prodigy

Hi all,
I am trying to add fasttext word vectors to spaCy, so that I can save a model to use in a NER recipe. The embeddings are domain specific (legislative text) and they are trained on a very large corpus, I thought they would do a better job in helping NER than pretrained embeddings.
I browsed the forum and the documentation and I came up with the following steps:

  1. use my text corpus to train a fasttext model and get a bin fasttext file
  2. extract word vectors from the bin and save them in a txt format (each line contains a word followed by its vector, each value is space separated)
  3. add word vectors to spacy python -m spacy init-model en /ft_vectors --vectors-loc EURLEX_ft_vectors.txt
  4. (not so sure about this) pretrain the model on raw corpus: python -m spacy pretrain EURLEX_raw.jsonl ./ft_vectors ./pretrained_model --use-vectors, does this make sense?
  5. train a NER model on annotated data: python -m prodigy train ner EURLEX_training_set ./ft_vectors --init-tok2vec ./pretrained_model/model499.bin --output ./model1

As of now I am stuck in step 3. I get the following error when running spacy init-model.

(My framework:
.spacy 2.3.5
.prodigy 1.10
.python 3.8
.windows 10)

Traceback (most recent call last):
File "C:\Anaconda\envs\prodigy\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Anaconda\envs\prodigy\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy_main
.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 113, in init_model
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 207, in add_vectors
vectors_data, vector_keys = read_vectors(vectors_loc, truncate_vectors)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 228, in read_vectors
shape = tuple(int(size) for size in next(f).split())
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 228, in
shape = tuple(int(size) for size in next(f).split())
ValueError: invalid literal for int() with base 10: 'the'

It appears that spacy expects something different in the first line. Should I load the vectors in another format?
Any help on this is greatly appreciated!

Hi, a word2vec vectors file is expected to have a header line in the format:

#vectors #dim

So often something like this (here 500,000 300-dim vectors):

500000 300
1 Like

Thank you Adriane! That seemed to be the problem, now everything works perfectly.

Can I ask you one more question (both NER and spacy are new to me, I apologize if this is trivial), do you think that the workflow I sketched above (steps 1-5) makes sense? (particularly step4, pretraining).

Thanks again!

The pretraining step is optional and can help you boost accuracy, especially if you have a lot of raw data and your domain is pretty specific. It's not a must, though, so just see how you go. Just start with training a model without pretrained tok2vec weights first and see what results you're getting. You can then always experiment with the pretraining later :slightly_smiling_face: