Loading fasttext vectors to spacy/prodigy

Hi all,
I am trying to add fasttext word vectors to spaCy, so that I can save a model to use in a NER recipe. The embeddings are domain specific (legislative text) and they are trained on a very large corpus, I thought they would do a better job in helping NER than pretrained embeddings.
I browsed the forum and the documentation and I came up with the following steps:

  1. use my text corpus to train a fasttext model and get a bin fasttext file
  2. extract word vectors from the bin and save them in a txt format (each line contains a word followed by its vector, each value is space separated)
  3. add word vectors to spacy python -m spacy init-model en /ft_vectors --vectors-loc EURLEX_ft_vectors.txt
  4. (not so sure about this) pretrain the model on raw corpus: python -m spacy pretrain EURLEX_raw.jsonl ./ft_vectors ./pretrained_model --use-vectors, does this make sense?
  5. train a NER model on annotated data: python -m prodigy train ner EURLEX_training_set ./ft_vectors --init-tok2vec ./pretrained_model/model499.bin --output ./model1

As of now I am stuck in step 3. I get the following error when running spacy init-model.

(My framework:
.spacy 2.3.5
.prodigy 1.10
.python 3.8
.windows 10)

Traceback (most recent call last):
File "C:\Anaconda\envs\prodigy\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Anaconda\envs\prodigy\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy_main
.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 113, in init_model
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 207, in add_vectors
vectors_data, vector_keys = read_vectors(vectors_loc, truncate_vectors)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 228, in read_vectors
shape = tuple(int(size) for size in next(f).split())
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 228, in
shape = tuple(int(size) for size in next(f).split())
ValueError: invalid literal for int() with base 10: 'the'

It appears that spacy expects something different in the first line. Should I load the vectors in another format?
Any help on this is greatly appreciated!

Hi, a word2vec vectors file is expected to have a header line in the format:

#vectors #dim

So often something like this (here 500,000 300-dim vectors):

500000 300
1 Like

Thank you Adriane! That seemed to be the problem, now everything works perfectly.

Can I ask you one more question (both NER and spacy are new to me, I apologize if this is trivial), do you think that the workflow I sketched above (steps 1-5) makes sense? (particularly step4, pretraining).

Thanks again!

The pretraining step is optional and can help you boost accuracy, especially if you have a lot of raw data and your domain is pretty specific. It's not a must, though, so just see how you go. Just start with training a model without pretrained tok2vec weights first and see what results you're getting. You can then always experiment with the pretraining later :slightly_smiling_face:

1 Like

Hey! I'm trying to do something similar, but I --use-vectors doesn't work for me.

Is there something I need to put in the config file to explicitly point to the static vectors for the spacy pretrain command?

I'm using

  • prodigy 1.11.7
  • spaCy 3.2.1
  • python 3.9.8

thanks!

I think I managed to get it going.

I set the pretraining.objective as in the documentation :

[pretraining.objective]
@architectures = "spacy.PretrainVectors.v1"
maxout_pieces = 3
hidden_size = 300
loss = "cosine"

and then pointing to the data using --paths.raw_text and the vectors using --paths.vectors

Would this be correct?

Thanks,

Darren

That sounds correct. Double-check that the vectors setting in the [initialize] block is ${paths.vectors} and not a hard-coded model name or path.

(Depending on the options, spacy init config or the quickstart may have put a hard-coded value in [initialize.vectors]. We'll try to make it more consistent for --paths.vectors in the future, but in the meanwhile double-check to be sure it's loading the vectors you want.)

1 Like

Hi Adriane,

Yes the vectors were set in the config with ${paths.vectors}. I suppose I was just being over-cautious!

Thanks very much :slight_smile:

Darren

Could I just check the correct input data format for spacy pretrain?

I had a jsonl file with each line as:
{'text': 'Here is an example document that could be much longer than this, containing more than one sentence.'}

I ask only because I've now tried training a new NER model using the tok2vec weights and performance hasn't improved from before.

Thanks,

Darren

Yes, that looks correct. If you're not seeing any improvement at all, double-check that you've set the init_tok2vec correctly in the config and it's actually loading the weights. If it hasn't improved very much, this could also indicate that that the pretrained tok2vec weights aren't that useful or that you might need to pretrain with more raw data.

1 Like