Prodigy batch train and contextual weights

If I take a standard GloVe vector embeding and use text classification batch train feature of prodigy, does the batch train algorithm alter the weight of the words based on the contextual meaning of the words or do I have to do ner batch train and then use text classification in prodigy to take account of that.

There are two model architectures available for text classification:

  • The low_data=False architecture uses two convolutional layers after the word vectors, so the model can learn from phrases up to three words long.

  • The low_data=True architecture does not use convolutional layers, so it can only learn from single words.

By default low_data=True is enabled if you have <1000 examples. You can also pass low_data=False explicitly when creating the TextCategorizer object, to make sure you are getting contextual meanings. You can find the two architectures defined within spaCy here: https://github.com/explosion/spaCy/blob/master/spacy/_ml.py#L469

I have more 1 million pre-trained classified text data for a specific label, so, when I use prodigy batch train, I assume that it would learn that contextual meaning and the resulting spacy model would contain adjusted weights based on the contextual meaning based on the healthcare-specific data. Once I have the spacy model, can I then use it as the generalized model (not just for the pre-trained label for which I was using it for ) for any classification problem not the current one at hand?

Additionally, does the text classification do all the Standard Tokenization, Lemmatization, stop word removal etc., during the training, I assume yes (might be a dumb question).

The text classification tokenizes, but it doesn’t lemmatize or remove stop words. Those processes aren’t always good, and the CNN benefits from the presence of function words to learn from non-compositional phrases.

Once I build a spacy model can I use it as generalized model?
My understanding is that it might be not effective as the Attend layer would optimize the probability weights in vector space for the current label under consideration. For health-care specific use case then, would it be better if i :

  1. take a corpus of million events (docs) do NER batch training using GloVe or Word2Vec, as the starting point, and develop a base model ( applicable to all use cases) or develop a health-care-specific word2vec if we have more than 10 million events?
  2. Use Prodigy, and either manually annotate or use pre-trained data to develop spacy model for a specific use case. [using the above word2vect step]
  3. Use prediction to predict the outcome of certain text

any thoughts on above?

I might be misunderstanding the question here. Let me know if it seems so.

The choice of doing NER training or text classification training mostly depends on your end goal. In some situations you can do sentence classification instead of NER to achieve some purpose; in other cases you definitely want entity recognition. It depends on the task, and what sort of output you need.

If per-sentence labels can solve your business need, I would definitely try to train the text classifier first. Sentence labels are generally easier for the model to predict, and also faster to annotate. Labelling specific sequences of text introduces a lot of difficult decisions your task might not care about. These decisions slow down learning because if the model is off by one token, it must consider its answer incorrect. Similar considerations apply during annotation.

Maybe I will put in a different way. I have two tasks to be performed for text classification on a set of same millions of documents. I have pre-annotated data for task 1 and not for task2. If I load the data for task1 and run the prodigy batch train, it generates a spacy model. Can I use it as the generalized model for task1 and task2 or its only applicable for task1.

If the output model is for task1 only then, My question is, how do I create generic spacy vector model for health-care specific data, which is not specific to task but for the specific domain using prodigy or spacy ?