I suspect that I’m doing something wrong, but can’t figure out what it is. I’m trying to produce seed terms for textcat training. I’m working with Prodigy 1.3.0 and spaCy 2.0.9.
Here’s what I’ve done.
python -m spacy download en_vectors_web_lg
prodigy terms.teach pet_seeds en_vectors_web_lg --seeds “dog, cat, fish, bird”
And I get Initialising with 4 seed terms: dog, cat, bird, fish
When I start the teaching session in the web browser, I am presented with very unusual words. Here’s a screenshot with the most recent set of words.
I’ve tried with the en_core_web_md model, with the same results. I’ve also tried with different initial seed values. No matter what I try I get uncommon words. Any ideas where my error is?
BTW, the recipes documentation does not include the --seeds parameter for the terms.teach recipe. I also tried omitting that but was given an ‘unrecognized arguments’ error.
I decided to try a fresh install into a new virtual environment, and that seems to have resolved the issue. I’m getting much better results with that. I wonder what in my previous virtual environment could have caused this issue.
Thanks for updating with your solution – that’s very mysterious
Maybe spaCy didn’t upgrade or install cleanly… but then again, there shouldn’t be any differences in the model compatibility across 2.x versions and v1.x wouldn’t have worked at all.
If you previously had spaCy v2.0.4 installed and the upgrade didn’t work properly, this might be a possible explanation: v2.0.5 (which was released a day after) fixes a bug that could cause vectors to be set to None. If there are no vectors, all entries in the vocab will be just as similar to your seed terms, and terms.teach will suggest whichever vocab entries it comes across first… which might explain the random words you were seeing. (I might be completely wrong, though – this was just the first idea that came to mind.)
I was trying to reuse a virtual environment that I had set up for work with AllenNLP, but it was also managed by conda. Doing a pip install of prodigy into a conda environment may have been my problem :-).
Hi, I suspect I’m having a related problem to this, but the proposed solution isn’t exactly working.
I’m using Prodigy 1.4.1 and spacy 2.0.12 in a fresh virtual environment. When I try to do a basic test case scenario prodigy terms.teach test en_core_web_md -se 'cat, dog, rabbit', it seems to work fine, giving terms like “tabby”, “kitten” or related things.
However, I also have a custom language model made from loading gensim pre-trained vector into a blank spacy ‘en’ model. The behavior works as expected inside of spacy: nlp(‘aspirin’).similarity(nlp(‘ibuprofen’)) -> 0.708. nlp(‘donut’).similarity(nlp(‘ibuprofen’)) -> 0.01
When I try to use this model in prodigy, all I get back as suggestions to my seed terms list (about 15 drugs and chemical names) are short, unrelated terms:
“ol, not, pm, ll, gon, sha, does, ta…” It presents about 20 of these (all rejected), loops through the same terms again and then runs out of examples.
Do you have any suggestions as to why this might be happening?
To find other terms , the terms.teach recipe will iterate over the entries in the model's vocab. So maybe your custom model doesn't actually have the words present in its vocabulary?
This would explain why the only terms you see are the seed terms (which were added to the target Doc and are then part of the vocab) and why it works when you use the model manually (because words you process are then added to the vocabulary). You can test this by looking at len(nlp.vocab) – the number should be roughly the number of word you've added vectors for.
In your code, make sure to use the vocab.set_vector method to also add the word to the vocab. Alternatively, you could also use vocab.strings.add to add strings to the vocabulary directly.
Ah, thanks @ines I think that’s exactly the problem, I guess I was setting the vocab/vectors incorrectly.
So it does not suffice to just say: nlp.vocab.vectors = spacy.vocab.Vectors(data=gensim_vectors.vectors, keys=gensim_vectors.index2word), because it does not actually add the strings to the vocabulary until I try to process a doc with those strings in them I suppose.
I tried using for word in gensim_vectors.index2word: nlp.vocab.strings.add(word.encode('utf-8'))
after loading in vectors as above, but it didn’t affect the result of len(nlp.vocab), so maybe that’s also an incorrect usage. And then, I suppose the issue with using for i, word in enumerate(gensim_vectors.index2word): nlp.vocab.set_vector(word, gensim_vectors.vectors[i]) is that it takes a very long time (there are about 2 million tokens in the vocabulary). But if the second situation is what’s required I can wait.
[EDIT] Actually I checked len(nlp.vocab.strings) after adding the strings directly with nlp.vocab.strings.add and it seems to be correct, but nlp.vocab is still small. So maybe that’s fine, I’ll see if it produces the expected behavior in prodigy.
There are a few levels at which a word can be associated with a word vector. The motivation for the multi-level approach is efficiency, especially in memory usage.
The Vectors class holds a numpy array with the current vectors, and then holds a mapping from uint64 keys to int32 values, where the integer value indicates the row of the vector. This design allows us to have multiple keys mapped to the same vector, which is great because word vectors tend to have lots of very similar rows for near synonyms. The md models make use of this by storing rows for the 20,000 most common words, and then mapping all other keys to the closest vector. So we get word vectors for lots of keys, without a huge data requirement.
We can easily find the uint64 key for any string, as it’s just a hash. But we often also want the reverse: we want to know what string some key corresponds to. This information is owned by the StringStore.
Finally, we also have Lexeme objects which hold other vocabulary information, such as the word probability, cluster, cached lexical attributes, etc. These live in the Vocab object. We store lots of data in these Lexeme objects, so they’re a bit bigger than the strings.
In summary:
You don’t always need a unique word vector for every key. You can map multiple keys to the same vector.
If you know you don’t need the strings, you might not need to add those, saving some space.
Even if there’s no entry in the vocab, you can still retrieve the vector.
My question was mainly in regards to how the terms.teach recipe utilizes the existing vocab to present decisions. I know that in typical spacy usage, I can retrieve the vector for the strings I type in, or calculate the similarity between two strings. But it didn’t seem to work the same way in prodigy for me, or at least what I was being presented didn’t seem to make sense to me. So my original question was about how I am supposed to properly build a spacy language model using pre-trained vectors for use in prodigy, because nothing I was trying seemed to work (actually I’m still having the same issues.)
The source of the built-in recipes is included with Prodigy, so you can actually have a look at how the terms.teach works. To find the location of your Prodigy installation, you can print(prodigy.__file__).
The stream is based on the lexemes in the vocab:
lexemes = [lex for lex in nlp.vocab if lex.is_alpha and lex.is_lower]
The stream will loop over the lexemes and score them according to the target vector. The Probability sorter then decides whether to suggest a term or not. Accepted suggestions are added to the accept_doc, and rejected suggestions to the reject_doc. Those are used to score new incoming vocab entries, based on how similar they are to the accept_doc and how dissimilar they are to the reject_doc.
Hi! Not sure if creating a different topic for my question is necessary (sorry if needed ), so I'm using this thread as it was the first one related to my issue. Is there a reason I'm always having single words matchings when using seeds forner.teach? Examples of my seeds are: it consulting, software development process, web development, etc ... maybe I have to place them differently?
Hi! Do you mean ner.teach or terms.teach? If you mean terms.teach, the reason is that it uses word vectors, and those only contain vectors for single tokens. Since the suggestions are all based on the entries in the vectors table, it means that all there is to suggest are single tokens.
If you want to query vectors for multi-word expressions, you might want to check out sense2vec, which also includes recipes for Prodigy (sense2vec.teach)
Hi @ines ! Yes! Thank you so much for your quick response, I found the sentence only contain vectors for single tokens in a different entry and it was ... OMG yessss! of course hehe