Custom Model Vocab Issues

I have created spacy model converting from gensim word2vec, however, I found AttributeError: ‘FunctionLayer’ object has no attribute ‘vectors’ issue. I followed work around given in the GitHub
https://github.com/explosion/spaCy/issues/1727. I was able to load the spacy model, however, when I do len(nlp.vocab), the length of the vocab and the total number of entries are way different.

I see 4 million words in strings.json but the len(nlp.vocab) is just 1 million. is it a bug or my understanding is incorrect?

spaCy hashes all strings, and only deals with the 64-bit hash. This keeps data local, allows export to numpy arrays, and is generally good for performance.

The Vocab needs to create a LexemeC struct for every entry, which contains a lot of features, e.g. the prefix, suffix, norm, probability, cluster ID, etc.

So, it’s fairly normal to have many more strings entries than vocabulary entries. Example:


>>> vocab = Vocab()
>>> print(len(vocab), len(vocab.strings))
0 1
>>> word = vocab['magnificent']
>>> print(len(vocab), len(vocab.strings))
1 2
>>> word.prefix_ = 'mag'
>>> print(len(vocab), len(vocab.strings))
1 3

When we set the prefix feature, spaCy hashes it — so we end up with another entry in the string store, but not in the vocab.

spaCy’s vectors table also supports having vectors for strings that aren’t necessarily in the vocabulary. In fact you don’t even need a distinct row for every term you map to a vector: you can map two keys to the same row, to save space. For instance, you can have both the lower-case and upper-case form of a word map to the same vector.