Prodigy doesn't "converge" fast to initial word seeds

I am working on a use-case that is very similar to the insults classifier. However, I had to augment the en_core_web_md model to include terminology that was inexistent in its vocabulary.

To do the latter, I used GloVe on an extract of approximately 140k documents containing the terminology I am after.

I used the information provided on Spacy’s documentation to successfully create a new “Vectors only” package which installs and loads successfully.

Now, when I move over to Prodigy, I supply a set of seed terms, some of which I know that they exist in my corpus. I have confirmed this both by examining the vectors.txt (the word is there) and from “within the module” by importing it and then invoking the get_vector on the language’s vocabulary.

However, Prodigy doesn’t even start from somewhere near those terms. I have inserted a large number of "don’t care"s so far but it still doesn’t get near the sort of topic I need to be tagging.

Do you have any ideas on how to improve this?

  1. Could it be that I have omitted something during compilation and packaging of the new vocabulary?
  2. Should I completely substitute the previous vocabulary?
  3. I noticed that the Language class encodes a large number of language-specific code. In my case, the only thing that I modify is the vocabulary of an existing model. My assumption is that the rest of the information is not wiped out (?). Is that correct?

Have you verified that the vectors you learned look good otherwise? Training word vectors can be a bit of a lottery sometimes, because there’s no really convincing way to run an automated evaluation.

You should have a look at the nearest neighbours of your seed terms, and then check that the similarities in your vocab match what you expect:

seed1 = nlp.vocab[u'apple']
seed2 = nlp.vocab[u'orange']

If the similarities in spaCy don’t match up with what you expect, have a look at the words’ vectors, to see that they have the correct values. Next, also check the size of the vocabulary with len(nlp.vocab). If you have many more words than you have vectors, I’d say that’s the problem.

Matthew, thank you for such a quick response, it was indeed helpful, please see below:

check that the similarities in your vocab match what you expect:

Without performing an exhaustive check, those that I did check did not match. I went for really stark examples in the desired context and they failed completely.

have a look at the words’ vectors, to see that they have the correct values.

They do not. In as much as a straight comparison between the data in vectors.txt and what I get from nlp.vocab.get_vector for the same word can tell. This is standard GloVe output which was copied to a separate directory as vectors.300.d.bin.

Is there anything I can do about this? If not, I would not like to fiddle around with the binary file too much. If all else fails, could I simply load the text version and import the vectors manually one-by-one to the dictionary?

check the size of the vocabulary with len(nlp.vocab)

Indeed, I have more words than vectors by a factor of ~62. This might become a problem later on but I would first fix the “data doesn’t load properly” issue and then try to improve performance, possibly by trimming down the existing dataset.

Regarding this, the documentation mentions that:

If your instance of Language already contains vectors, they will be overwritten.

Just to confirm, this doesn’t seem to be a complete overwrite (?). It is more like an “update”. If from_glove returns words that are in the current vocabulary these are updated, new ones are created, everything else remains untouched (?). If that is the case, I would try to reduce the terms that are there from the dataset that I am trying to extend (en_core_web_md). Any better alternatives?

GloVe outputs in either double or single precision, with the d or f in the filename indicating which. I thought I had logic to check this, but maybe it’s not working. Either way, that seems like the answer!

Thank you, I will try this shortly but here is where I think that the problem is at the moment as I am also checking the result of each step:

  1. The vector is loaded “flat” and I followed your recommendation from here to reshape it to what it is supposed to be. (This has been incorporated in the latest version…but…)

  2. returns float32 when it should be float64 (the default output of GloVe according to this note) (?).

  3. According to (github dot com slash) explosion/spaCy/blob/master/spacy/vectors.pyx#L311 (sorry, can’t post two links in a post just yet as it seems :slight_smile: ), everything seems to be “casted” to float32 if the dtype is not float32 (?).

  4. len( returns exactly my file size (in bytes) divided by 8 which suggests to me that it has read the file in as float64.

  5. The other thing that I noticed is that the width passed to reshape has to be N+1 where N is the vector size parameter of glove. Otherwise, reshape fails to fit the data in a “frame”. OR this is a byproduct of the above data type mismatch.

  6. I also tried to rename the file with an .f. instead of a .d. but that was giving errors at the point of loading the model from the disk.

From all of this, I am more inclined to believe that GloVe outputs float64 indeed and this could possibly be taken into account in from_glove (?).

In the meantime, I can use the above code to proceed with what I am trying to, hope this helps anyway.

I missed this edit yesterday, possibly while compiling my last post, yes, I think you are right, there probably is some sort of mismatch on the way the binary data is loaded.

If you come up with some short term fix please let me know. In the meantime, from the code that you provided yesterday, there does not seem to be an add_vector, is addition in two steps? (One, add the key, two, add the vector for that key) (?)

Sorry – the method is nlp.vocab.set_vector(), not add_vector

Yep :slight_smile: Worked that out in the end. A gist with my current workflow is available here, in case it is any further use to anyone else.

Many thanks for the prompt replies and extensive help.

(Retraining helped but not to the expected extent BTW. The new “similarities” are much better but the terms still seem to revolve around the subject but at “great distances”.)

1 Like