Why did prodigy move away from word vectors?


I am a bit confused about the difference demonstrated in the tutorial video and the recent usage advice. Here https://prodi.gy/docs/text-classification/#active-learning in the insults textcat video you suggest using wordvectors to search related sentences but now we are using patterns-jsonl file.

If my patterns are not enough how will prodigy locate the sentences with ease? for example if I am labeling sentences relating to risk and my pattern has only a lemma for loss but there are synonyms like challenges or issues. Also word vectors would find words but patterns can also include 2 words. I feel removing the word vectors approach has reduced space prodigy could locate the appropriate sentences.

Hi! I'm not 100% sure I understand the question correctly because the general suggestion here has always been the same. One idea for bootstrapping a text classifier is to use matches to pre-select examples that contain certain words and mixing them in with the model's suggestions. This can often help move the model in the right direction.

The patterns don't have to cover everything because you're still teaching the model to generalise and annotating other examples without matches. But they can help select more relevant examples, especially if you have lots of raw data and very imbalanced classes. It might be less useful for smaller datasets with more balanced classes.

There are different ways to create patterns. One is to use word vectors to suggest similar words to some seed terms you come up with (which will only be single words). There are different approaches to training vectors for multi-word expressions, for example what we did in sense2vec: GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors The role of the vectors here is only to help you find similar words. You can also use other resources, like existing terminology lists.

The use of word vectors to bootstrap terminology lists shouldn't be confused with the use of vectors as embeddings to initialise a model – the idea here is to just start with more meaningful token embeddings and boost your accuracy. Transformer embeddings like BERT etc. fulfil a similar role.