Why did prodigy move away from word vectors?

rishkum · September 14, 2021, 10:41am

Hi

I am a bit confused about the difference demonstrated in the tutorial video and the recent usage advice. Here https://prodi.gy/docs/text-classification/#active-learning in the insults textcat video you suggest using wordvectors to search related sentences but now we are using patterns-jsonl file.

If my patterns are not enough how will prodigy locate the sentences with ease? for example if I am labeling sentences relating to risk and my pattern has only a lemma for loss but there are synonyms like challenges or issues. Also word vectors would find words but patterns can also include 2 words. I feel removing the word vectors approach has reduced space prodigy could locate the appropriate sentences.

ines · September 16, 2021, 1:19am

Hi! I'm not 100% sure I understand the question correctly because the general suggestion here has always been the same. One idea for bootstrapping a text classifier is to use matches to pre-select examples that contain certain words and mixing them in with the model's suggestions. This can often help move the model in the right direction.

The patterns don't have to cover everything because you're still teaching the model to generalise and annotating other examples without matches. But they can help select more relevant examples, especially if you have lots of raw data and very imbalanced classes. It might be less useful for smaller datasets with more balanced classes.

There are different ways to create patterns. One is to use word vectors to suggest similar words to some seed terms you come up with (which will only be single words). There are different approaches to training vectors for multi-word expressions, for example what we did in sense2vec: GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors The role of the vectors here is only to help you find similar words. You can also use other resources, like existing terminology lists.

The use of word vectors to bootstrap terminology lists shouldn't be confused with the use of vectors as embeddings to initialise a model – the idea here is to just start with more meaningful token embeddings and boost your accuracy. Transformer embeddings like BERT etc. fulfil a similar role.

Topic		Replies	Views
Text Classification, Bootstrapping Error textcat	1	618	June 7, 2018
textcat.teach repeatedly annotating the same text, not annotating entire text at once usage , textcat	1	569	November 22, 2019
Seeds for text classification appearing multiple times usage , textcat	1	629	June 27, 2019
Textcat.teach not using the pattern file enhancement , textcat , done	10	1814	September 20, 2022
Textcat.teach seemed to stop displaying seed matches after the first few?	5	284	June 1, 2022

Why did prodigy move away from word vectors?

Related Topics