I am working on a use-case that is very similar to the insults classifier. However, I had to augment the en_core_web_md
model to include terminology that was inexistent in its vocabulary.
To do the latter, I used GloVe on an extract of approximately 140k documents containing the terminology I am after.
I used the information provided on Spacy’s documentation to successfully create a new “Vectors only” package which installs and loads successfully.
Now, when I move over to Prodigy, I supply a set of seed terms, some of which I know that they exist in my corpus. I have confirmed this both by examining the vectors.txt
(the word is there) and from “within the module” by importing it and then invoking the get_vector
on the language’s vocabulary.
However, Prodigy doesn’t even start from somewhere near those terms. I have inserted a large number of "don’t care"s so far but it still doesn’t get near the sort of topic I need to be tagging.
Do you have any ideas on how to improve this?
- Could it be that I have omitted something during compilation and packaging of the new vocabulary?
- Should I completely substitute the previous vocabulary?
- I noticed that the
Language
class encodes a large number of language-specific code. In my case, the only thing that I modify is the vocabulary of an existing model. My assumption is that the rest of the information is not wiped out (?). Is that correct?