Web UI for pre-trained Chinese vectors

Hello,

I have been trying to perform text classification in Chinese language to label my data. I tried using my own trained word Vectors from Gensim and pre-trained vectors from Fastext but the problem is in both the cases, when I use the terms.teach for my seed terms, the Web browser only shows me English words and not Chinese. I am not able to understand this as my word Vectors also contain Chinese words and I am passing Chinese words only as my seed terms in order to bootstrap. For the both the word vector files, I did the following to make it compatible with Prodigy-

nlp = spacy.load(“en_core_web_sm”, vectors=False)

with open(“wiki.zh.text.simplified_jieba_seg_cbow_w8_mc3.txt”, ‘rb’) as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
for line in file_:
#line = line.decode(‘utf8’).strip()
line = line.rstrip().decode(‘utf8’)
pieces = line.rsplit(’ ', int(nr_dim))
word = pieces[0]
vector = np.asarray([float(v) for v in pieces[1:]], dtype=‘f’)
nlp.vocab.set_vector(word, vector)

nlp.to_disk(‘my_vectors’)

There’s no error shown in this but when I try to bootstrap the seed terms, the UI only shows English words so how can I resolve this?

Thanks

One thing that might be problematic here:

nlp = spacy.load("en_core_web_sm", vectors=False)

You’re starting off with the English model, but adding Chinese vectors. To be safe, you probably want to use a blank Chinese model instead:

nlp = spacy.blank('zh')

Also, can you double-check your command and make sure you’re definitely loading in the correct my_vectors model? And what happens if you try it manually?

Prodigy’s terms.teach doesn’t do any magic here – it just iterates over the model’s vocabulary, looks at the similarities and builds up the target vector. So what do you see if you look at the model’s vocab in spaCy (or by inspecting the files in your model directory)?

Thanks for your response. My word vectors are intact after making it compatible with Prodigy and loading via Spacy so my word vector model is fine. Another thing is , my word vector consists of majority of Chinese words and a few English words, do you think that’s fine?
I’ll try loading the blank zh model and get back to you.
Thanks

Hey,
I tried using the ‘zh’ model but I get an error when I try to use the following command-

prodigy terms.teach chinese_terms fastext_vectors_1 --seed ‘好,惊人,大’

The error I’m getting is this-

/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192, got 176
return f(*args, **kwds)
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192, got 176
return f(*args, **kwds)
Initialising with 3 seed terms: 好, 大, 惊人
/home/prodigy/pgy-env/bin/prodigy: line 1: 20829 Segmentation fault (core dumped) python -m prodigy “$@”

The weird part I’m getting the same error for my old word vectors based on ‘sm’ model and they were working fine until yesterday. So can you help me in resoving this error?

Thanks

That definitely indicates some problem. Could you run pip list and check which versions of spaCy, Thinc and Prodigy you're running?

That should be fine. Can you double-check that those words were also added to the model's vocabulary? See this post for more details:

So following are the versions-

prodigy (1.5.1)
spacy (2.0.12)
thinc (6.10.3)

Do I need to update any of these?
Yes, also the chinese words are giving me word vectors via spacy after I save the modified model to disk, so my word vectors arrays are intact.

Thanks

Another thing I discovered is that initially my word vectors has 1167947 vectors but after using the code in my first message to make it compatible with Prodigy, when I use len(nlp.vocab), the output is just 37. So for 1167947 word vectors, only 37 words are added to my vocabulary, which is making me question my code, but I can’t see the mistake. Can you help me figure it out?