StringStore exception

madhujahagirdar · March 7, 2018, 10:49pm

Guys, I wanted to covert spacy to gensim model and I am encountering issues while iterating the keys, is it a bug

Any idea what could be the issue.

import spacy


nlp = spacy.load("en_core_web_lg")

#print(nlp.vocab.vectors.data)
#print(nlp.vocab.strings)
row, dim = nlp.vocab.vectors.shape

fn = open("/Users/philips/Development/BigData/RS/annotation/gensim_model.txt","w")

fn.write(str(row))
fn.write(" ")
fn.write(str(dim))
fn.write('\n')

row = int(row)

#nr_row , dim
#word vectors

words = nlp.vocab.strings

#print(nlp.vocab.strings[4183861688597294412])

# count = 0
# for key, vector in nlp.vocab.vectors.items():
#     word = nlp.vocab.strings[key]
#     stringval = ""
#     for item in vector:
#         stringval =  str(item) + " " + stringval
#     stringval.strip();
#     fn.write(word +" "+stringval);
#     fn.write('\n');
#     row = row - 1;
#     count = count + 1;
#     print(row)
#
# fn.close()

Error:
Traceback (most recent call last):

  File "strings.pyx", line 118, in spacy.strings.StringStore.__getitem__
KeyError: 4183861688597294412

madhujahagirdar · March 18, 2018, 4:00pm

@ines @honnibal

I was planning to create word2vec visualization in tensor board and wanted to export out the matrix in tensorboard format while doing this I found this exception while going through the normal en_core_web_lg, any thoughts on what could be going on?

honnibal · March 19, 2018, 2:03pm

Where does the error actually occur? Is it in the bits you’ve commented out?

The error indicates there are vector entries for words that aren’t in the strings table. You can prevent the key error by adding a if key in nlp.vocab.strings check in your loop.

madhujahagirdar · March 19, 2018, 2:04pm

Yes

honnibal · March 19, 2018, 2:05pm

The error indicates there are vector entries for words that aren’t in the strings table. You can prevent the key error by adding a if key in nlp.vocab.strings check in your loop.

madhujahagirdar · March 19, 2018, 2:08pm

This is directly from the en_web_core_lg, so i guess we should have all the entries? am I missing something or the outcome is ok

honnibal · March 19, 2018, 3:04pm

In general the vectors aren’t limited to only the strings in the stringstore. I think I might have used different frequency thresholds for the two, which could be improved. You might want to count how many strings are missing.

Topic		Replies	Views
Custom Model Vocab Issues spacy	1	960	March 6, 2018
Bus Error/Segmentation Fault - Custom Gensim Vectors done , spacy , solved	3	803	July 10, 2018
E018 when fine-tuning parser spacy , solved , to-be-released , dep , training	12	1012	September 30, 2021
Problems when saving model with blank NER spacy , solved	6	1591	July 24, 2018
spacy model loading regression done , spacy	5	1696	April 10, 2018

StringStore exception

Related topics