StringStore exception

@honnibal, @ines

Guys, I wanted to covert spacy to gensim model and I am encountering issues while iterating the keys, is it a bug

Any idea what could be the issue.

import spacy


nlp = spacy.load("en_core_web_lg")

#print(nlp.vocab.vectors.data)
#print(nlp.vocab.strings)
row, dim = nlp.vocab.vectors.shape

fn = open("/Users/philips/Development/BigData/RS/annotation/gensim_model.txt","w")

fn.write(str(row))
fn.write(" ")
fn.write(str(dim))
fn.write('\n')

row = int(row)

#nr_row , dim
#word vectors

words = nlp.vocab.strings

#print(nlp.vocab.strings[4183861688597294412])

# count = 0
# for key, vector in nlp.vocab.vectors.items():
#     word = nlp.vocab.strings[key]
#     stringval = ""
#     for item in vector:
#         stringval =  str(item) + " " + stringval
#     stringval.strip();
#     fn.write(word +" "+stringval);
#     fn.write('\n');
#     row = row - 1;
#     count = count + 1;
#     print(row)
#
# fn.close() 
Error:
Traceback (most recent call last):

  File "strings.pyx", line 118, in spacy.strings.StringStore.__getitem__
KeyError: 4183861688597294412

@ines @honnibal

I was planning to create word2vec visualization in tensor board and wanted to export out the matrix in tensorboard format while doing this I found this exception while going through the normal en_core_web_lg, any thoughts on what could be going on?

Where does the error actually occur? Is it in the bits you’ve commented out?

The error indicates there are vector entries for words that aren’t in the strings table. You can prevent the key error by adding a if key in nlp.vocab.strings check in your loop.

Yes

The error indicates there are vector entries for words that aren’t in the strings table. You can prevent the key error by adding a if key in nlp.vocab.strings check in your loop.

This is directly from the en_web_core_lg, so i guess we should have all the entries? am I missing something or the outcome is ok

In general the vectors aren’t limited to only the strings in the stringstore. I think I might have used different frequency thresholds for the two, which could be improved. You might want to count how many strings are missing.