I am working on a use-case that is very similar to the insults classifier. However, I had to augment the en_core_web_md model to include terminology that was inexistent in its vocabulary.
To do the latter, I used GloVe on an extract of approximately 140k documents containing the terminology I am after.
I used the information provided on Spacy’s documentation to successfully create a new “Vectors only” package which installs and loads successfully.
Now, when I move over to Prodigy, I supply a set of seed terms, some of which I know that they exist in my corpus. I have confirmed this both by examining the vectors.txt (the word is there) and from “within the module” by importing it and then invoking the get_vector on the language’s vocabulary.
However, Prodigy doesn’t even start from somewhere near those terms. I have inserted a large number of "don’t care"s so far but it still doesn’t get near the sort of topic I need to be tagging.
Do you have any ideas on how to improve this?
- Could it be that I have omitted something during compilation and packaging of the new vocabulary?
 - Should I completely substitute the previous vocabulary?
 - I noticed that the 
Languageclass encodes a large number of language-specific code. In my case, the only thing that I modify is the vocabulary of an existing model. My assumption is that the rest of the information is not wiped out (?). Is that correct? 
 ), everything seems to be “casted” to