I am newish to both NLP and SpaCy/Prodigy, and am in the process of getting acquainted with all the functionality. Some of my questions are really noobish, so feel free to point me in the direction of blogs, etc, if they exist with answers.
What difference does it make to training a model in prodigy using en_core_web_lg vs en_vectors_web_lg? Likewise, what difference would it make to teaching?
Follow up - for pretraining, what differences does the --use vectors flag make?
I'm a little confused about the workflow for teaching and training models in prodigy. To teach, I wouldn't start with a blank model? But to train, depending on the use case - perhaps a new entity - it may be better to start with a blank model. If so, i) is it advised to always use the same model for teaching and training; 2) how would I leverage a model pretrained using my raw texts when starting from a blank model?
This is maybe a really simple question - but when I use a model to train, is it automatically updated? Or only when I save it? So every time I run textcat.teach, for example, and add the model - am I adding an updated model or do I have to save the model after every session? Similarly, if I start with a blank model, do I have to save it afterwards?
I know there are a lot of questions here, and they are basic, sorry about that!
These are all great questions. Some of them relate to issues we'd love to find a clearer way of describing --- the vocabulary around this stuff is kind of difficult (everything is "pretrained", everything's a vector, etc).
The en_core_web_lg model has weights for the parser, tagger and NER components, while the en_vectors_web_lg model doesn't. The vectors in the two packages are basically the same, both from the common crawl GloVe data. The en_vectors_web_lg has a slightly larger vocabulary, but the extra words are all quite rare. If you don't want previous NER weights in your model, it's easiest to use the en_vectors_web_lg.
spaCy's models build word representations with several features. One option is whether to include a static vectors table as a feature. That's the difference between spaCy's en_core_web_sm and en_core_web_lg models. The --use-vectors flag tells spacy pretrain whether you want to use the static vectors table as a feature in the input or not. Using the static vectors usually improves accuracy a bit, but it makes the model take longer to load, and can be a bit inconvenient if you need to serialize to and from bytes a lot.
It might be helpful to remember that while you're annotating, the model is seeing data stream in, and it's trying to update itself to be more helpful in assisting the annotation. But once you want to actually train a model for usage, you're solving a different sort of problem. Once the data's all there, you may as well shuffle it all up and make several passes over it, just as you would in normal model training.
Prodigy's active learning recipes like ner.teach and textcat.teach therefore don't output a model by default. What you do instead is run textcat.teach and once the data is there, you run textcat.batch-train, and use that model. When you're using textcat.teach, you want to start with a prior model if you have one --- because why not start with a model that knows stuff, as an annotation helper? But when you're doing textcat.teach, you're often better off starting from a blank slate, and just using the static vectors and/or weights from spacy pretrain.
Imagine you were training in the normal sort of situation, like running a simple experiment on the MNIST corpus. You wouldn't train a checkpoint on mnist, and then resume from it --- that's not helpful, and it just makes your experiment difficult to repeat. Similarly, if you started out with 10k samples of MNIST, and trained one model, when you received your next 10k samples, you'd rather just start again and train on all 20k together. The same logic applies with Prodigy.
The main difference is that we're not able to provide you the training data for spaCy's NER, parser and tagger models. So in those situations, there can be an advantage to training on top of the previous model. However, you do have to be a bit careful. It's only really helpful if your annotations fit the prior schema well. Otherwise it can be better to start fresh.
There's a setting for this. In recipes that support it, it's the -t2v flag on the command line.
I hope my answers above have clarified this --- but if not, please do let me know.
Btw, a final thing to note: Sometimes the textcat.manual and ner.manual modes are actually better to use than the active learning recipes, textcat.teach and ner.teach, especially at the start of training. The model's behaviour can be difficult to reason about on a new problem, so sometimes it's better to keep it simple.
This applies especially to the ner.teach recipe, which uses a very aggressive strategy (binary annotation) that only really works when you have a fairly good initial model, or if your problem can be captured fairly well by matcher patterns.
Thanks so much for your great support and detailed responses, Honnibal!
That really clears things up a lot. Just to make sure I get it, and a few things I've noticed since this post:
I did begin using teach for a simple classification task - but the label I'm trying to classify is a minority in the text (a redflags classifier for reviews - aggression, insults, racism, food poisoning, etc). I tried to seed patterns to start, but the teaching model didn't pick up on them, even after I changed the sort method to prefer_low_scores - as per another suggestion on the forum. I eventually just used the seeds with spacy to get a sub-portion of the data and fed that into textcat.teach... not sure if there is a better way to deal with this?
I now plan to train a model with the labels I've created... after this, I would like to load the full dataset again, with the trained model this time, and perhaps the seed terms (?) back into teach or eval to create more data. Will this now work better for picking up on these minority cases?
Also, should I add the weights to teach/ eval or only to train? The language in the reviews is country and domain specific, so I really think the predictions may benefit from pretrained vectors. Or...am I already making full use of the pretrained vectors by including them in training? I think I get a bit confused around what I have to keep completely consistent between recipes.
Finally, what if I change my mind about a label and would like to relabel some texts I rejected previously?
Unrelated question:
If I want to train a multilabel classifier, and I start with manual to create the choice format, can I later use that multilabel format with teach or eval? Those work on accept/ not accept rather than choice....so I think I'd need a custom solution to change the accept to be the multiple choice answer(s)?
I initialized a blank model for teaching and hoped this would give the seeded patterns priority. But this doesn't seem to be the case...is there anyway to customize teach to prioritize the patterns for a session?
Follow up question: I am changing some of my labels. I've loaded the same dataset, with a blank model and seed terms that will bring up a lot of the instances I want to relabel. I know the dataset works with hashes that can't be duplicated - but I just wanted to check that there won't be duplicates in from relabeling?
Similarly, the current data I have is a subset of a larger dataset. If I load the larger dataset, and exclude this one, it won't automatically pick up the duplicate text, right? I have to remove the duplicates and load the new files, rather, outside of prodigy.