Matthew, thank you for such a quick response, it was indeed helpful, please see below:
check that the similarities in your vocab match what you expect:
Without performing an exhaustive check, those that I did check did not match. I went for really stark examples in the desired context and they failed completely.
have a look at the words’ vectors, to see that they have the correct values.
They do not. In as much as a straight comparison between the data in
vectors.txt and what I get from
nlp.vocab.get_vector for the same word can tell. This is standard GloVe output which was copied to a separate directory as
Is there anything I can do about this? If not, I would not like to fiddle around with the binary file too much. If all else fails, could I simply load the text version and import the vectors manually one-by-one to the dictionary?
check the size of the vocabulary with
Indeed, I have more words than vectors by a factor of ~62. This might become a problem later on but I would first fix the “data doesn’t load properly” issue and then try to improve performance, possibly by trimming down the existing dataset.
Regarding this, the documentation mentions that:
If your instance of Language already contains vectors, they will be overwritten.
Just to confirm, this doesn’t seem to be a complete overwrite (?). It is more like an “update”. If
from_glove returns words that are in the current vocabulary these are updated, new ones are created, everything else remains untouched (?). If that is the case, I would try to reduce the terms that are there from the dataset that I am trying to extend (
en_core_web_md). Any better alternatives?