Custom Model Vocab Issues

madhujahagirdar · March 6, 2018, 6:13am

I have created spacy model converting from gensim word2vec, however, I found AttributeError: ‘FunctionLayer’ object has no attribute ‘vectors’ issue. I followed work around given in the GitHub
https://github.com/explosion/spaCy/issues/1727. I was able to load the spacy model, however, when I do len(nlp.vocab), the length of the vocab and the total number of entries are way different.

I see 4 million words in strings.json but the len(nlp.vocab) is just 1 million. is it a bug or my understanding is incorrect?

honnibal · March 6, 2018, 3:09pm

spaCy hashes all strings, and only deals with the 64-bit hash. This keeps data local, allows export to numpy arrays, and is generally good for performance.

The Vocab needs to create a LexemeC struct for every entry, which contains a lot of features, e.g. the prefix, suffix, norm, probability, cluster ID, etc.

So, it’s fairly normal to have many more strings entries than vocabulary entries. Example:


>>> vocab = Vocab()
>>> print(len(vocab), len(vocab.strings))
0 1
>>> word = vocab['magnificent']
>>> print(len(vocab), len(vocab.strings))
1 2
>>> word.prefix_ = 'mag'
>>> print(len(vocab), len(vocab.strings))
1 3

When we set the prefix feature, spaCy hashes it — so we end up with another entry in the string store, but not in the vocab.

spaCy’s vectors table also supports having vectors for strings that aren’t necessarily in the vocabulary. In fact you don’t even need a distinct row for every term you map to a vector: you can map two keys to the same row, to save space. For instance, you can have both the lower-case and upper-case form of a word map to the same vector.

Topic		Replies	Views
Custom vectors loading issue spacy	2	900	January 22, 2020
StringStore exception spacy , gensim	6	774	March 19, 2018
spaCy init-model with word vectors - ValueError usage , spacy , solved	1	787	August 11, 2018
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	5143	August 15, 2018
Similarity spacy , gensim	2	1858	March 3, 2018

Custom Model Vocab Issues

Related topics