Is there a way to easily load the vector models that Fasttext provides for 294 languages?
Yes! All you have to do is add the vectors to spaCy, save out the model and load it with Prodigy. We actually have an end-to-end example script for adding fastText vectors to spaCy: see this page for the code and more details. Here’s the script:
After adding the vectors, you can simply call nlp.to_disk('/path/to/model')
, which saves the full model including the vectors to a directory. You can then load it into Prodigy using that path – for example:
prodigy terms.teach fruits_dataset /path/to/model --seeds apple,pear,banana
I happily admit that this is slightly out of my comfort zone , but this is what I get:
(myenv) REs-MacBook-Pro:~ redevries$ python vectors_fast_text.py ./wiki.nl.bin nl
Traceback (most recent call last):
File "vectors_fast_text.py", line 43, in <module>
plac.call(main)
File "/Users/redevries/.virtualenvs/myenv/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Users/redevries/.virtualenvs/myenv/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "vectors_fast_text.py", line 28, in main
nr_row, nr_dim = header.split()
ValueError: too many values to unpack (expected 2)
Interesting – you were definitely doing everything correctly. So my theory is that maybe the files distributed by fastText aren’t 100% consistent. We’ve tested the script on a few languages – but not all of the ~300 options.
The part that the script fails on here is the one where it extracts the first line and assumes it’s the header consisting of the number of rows and number of dimensions. But maybe the Dutch file doesn’t actually have a header row, or the header has an additional column or something like that.
I’ll download the Dutch vectors and inspect them! Will report back with my findings
Edit: Just re-read your example – could you try again using the .vec
file instead of the .bin
? Maybe the solution is actually much simpler.
Thanks for taking a look Ines!
I had tried the .vec file, but this seemed an even more significant error!
(myenv) REs-MacBook-Pro:~ redevries$ python vectors_fast_text.py ./wiki.nl.vec nl
Traceback (most recent call last):
File "vectors_fast_text.py", line 43, in <module>
plac.call(main)
File "/Users/redevries/.virtualenvs/myenv/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Users/redevries/.virtualenvs/myenv/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "vectors_fast_text.py", line 32, in main
pieces = line.rsplit(' ', nr_dim)
TypeError: 'bytes' object cannot be interpreted as an integer
Just downloaded the Dutch vectors and tested it – turns out that using the .vec
file is correct, but in this case, the number of dimensions the script extracted somehow became a byte string, not an integer. So changing the following line solved it for me:
pieces = line.rsplit(' ', nr_dim)
pieces = line.rsplit(' ', int(nr_dim)) # make sure nr_dim is an int!
Will also adjust this in the spaCy example.
I am using my fasttext embeddings (100D) for entity detection. Your script works fine until I try to add pipes ( nlp.add_pipe(nlp.create_pipe(pipe-name))
and save to disk - in which case I get the error:
*** TypeError: Required argument 'length' (pos 1) not found
If I try to save nlp without the pipes, I can - but its not the model I want. I have tried bringing over the tagger,parser and ner from en_core_web_lg and altering the meta.json file to include the appropriate pipes,(vectors attribute already changed).
"pipeline":["sbd","tagger","parser","ner"],
"vectors":{
"width":100,
"keys":166939,
"vectors":199336
},
But I am getting an error from the NER because of the difference of embedding sizes (300 vs. 100).
Any help is appreciated!
Also, do we lose out on all the nice spacy token attributes when we import our own embeddings?
Looking around some more I found:
In this thread: loading gensim word vecs
I ended up loading en_core_web_sm
, instead of the blank english model, loading the vectors using script provided above and saving using nlp.to_disk(model-path)
Once I trained a model using this base it was necessary to go into the ner/tagger/parser folder and change the cfg file so that pretrained_dims:0