Loading fasttext vectors to spacy/prodigy

giova_p · April 26, 2021, 7:50am

Hi all,
I am trying to add fasttext word vectors to spaCy, so that I can save a model to use in a NER recipe. The embeddings are domain specific (legislative text) and they are trained on a very large corpus, I thought they would do a better job in helping NER than pretrained embeddings.
I browsed the forum and the documentation and I came up with the following steps:

use my text corpus to train a fasttext model and get a bin fasttext file
extract word vectors from the bin and save them in a txt format (each line contains a word followed by its vector, each value is space separated)
add word vectors to spacy python -m spacy init-model en /ft_vectors --vectors-loc EURLEX_ft_vectors.txt
(not so sure about this) pretrain the model on raw corpus: python -m spacy pretrain EURLEX_raw.jsonl ./ft_vectors ./pretrained_model --use-vectors, does this make sense?
train a NER model on annotated data: python -m prodigy train ner EURLEX_training_set ./ft_vectors --init-tok2vec ./pretrained_model/model499.bin --output ./model1

As of now I am stuck in step 3. I get the following error when running spacy init-model.

(My framework:
.spacy 2.3.5
.prodigy 1.10
.python 3.8
.windows 10)

Traceback (most recent call last):
File "C:\Anaconda\envs\prodigy\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Anaconda\envs\prodigy\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy_main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 113, in init_model
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 207, in add_vectors
vectors_data, vector_keys = read_vectors(vectors_loc, truncate_vectors)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 228, in read_vectors
shape = tuple(int(size) for size in next(f).split())
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 228, in
shape = tuple(int(size) for size in next(f).split())
ValueError: invalid literal for int() with base 10: 'the'

It appears that spacy expects something different in the first line. Should I load the vectors in another format?
Any help on this is greatly appreciated!

adriane · April 26, 2021, 3:15pm

Hi, a word2vec vectors file is expected to have a header line in the format:

#vectors #dim

So often something like this (here 500,000 300-dim vectors):

500000 300

giova_p · April 26, 2021, 4:42pm

Thank you Adriane! That seemed to be the problem, now everything works perfectly.

Can I ask you one more question (both NER and spacy are new to me, I apologize if this is trivial), do you think that the workflow I sketched above (steps 1-5) makes sense? (particularly step4, pretraining).

Thanks again!

ines · April 27, 2021, 8:04am

The pretraining step is optional and can help you boost accuracy, especially if you have a lot of raw data and your domain is pretty specific. It's not a must, though, so just see how you go. Just start with training a model without pretrained tok2vec weights first and see what results you're getting. You can then always experiment with the pretraining later

DGMS90 · January 27, 2022, 4:10pm

Hey! I'm trying to do something similar, but I --use-vectors doesn't work for me.

Is there something I need to put in the config file to explicitly point to the static vectors for the spacy pretrain command?

I'm using

prodigy 1.11.7
spaCy 3.2.1
python 3.9.8

thanks!

DGMS90 · January 27, 2022, 4:42pm

I think I managed to get it going.

I set the pretraining.objective as in the documentation :

[pretraining.objective]
@architectures = "spacy.PretrainVectors.v1"
maxout_pieces = 3
hidden_size = 300
loss = "cosine"

and then pointing to the data using --paths.raw_text and the vectors using --paths.vectors

Would this be correct?

Thanks,

Darren

adriane · January 31, 2022, 7:57am

That sounds correct. Double-check that the vectors setting in the [initialize] block is ${paths.vectors} and not a hard-coded model name or path.

(Depending on the options, spacy init config or the quickstart may have put a hard-coded value in [initialize.vectors]. We'll try to make it more consistent for --paths.vectors in the future, but in the meanwhile double-check to be sure it's loading the vectors you want.)

DGMS90 · January 31, 2022, 9:57am

Hi Adriane,

Yes the vectors were set in the config with ${paths.vectors}. I suppose I was just being over-cautious!

Thanks very much

Darren

DGMS90 · February 9, 2022, 3:45pm

Could I just check the correct input data format for spacy pretrain?

I had a jsonl file with each line as:
{'text': 'Here is an example document that could be much longer than this, containing more than one sentence.'}

I ask only because I've now tried training a new NER model using the tok2vec weights and performance hasn't improved from before.

Thanks,

Darren

ines · February 13, 2022, 9:55am

Yes, that looks correct. If you're not seeing any improvement at all, double-check that you've set the init_tok2vec correctly in the config and it's actually loading the weights. If it hasn't improved very much, this could also indicate that that the pretrained tok2vec weights aren't that useful or that you might need to pretrain with more raw data.

Topic		Replies	Views
Using Fastext vector model in Prodigy? usage , spacy , solved	7	3403	March 15, 2018
Initializing custom model for ner usage , ner	1	517	January 25, 2021
Help with training from scratch english NER model with pretrained Gensim vectors usage , ner , spacy	2	645	January 27, 2022
word embeddings for prodigy train recipe usage , spacy , training	8	568	October 24, 2022
How do I work with available word vectors during NER training? ner , training	3	361	June 30, 2022

Loading fasttext vectors to spacy/prodigy

Related topics