Awesome! I had forgotten about that option too. I saw it the docs and thought it may help (so I decided to mention in case ).
Yes! They can. If you want to test, then try the en_core_web_lg
first. You'll need the vectors which are in the md
and lg
models. You may not see a lot of improvement but it'll likely not add much for compute speed and memory.
There's sometimes a tendency to immediately go to transformers (en_core_web_trf
), but they come with challenges (speed, memory, handling GPU). The speed/simplicity early on with the spaCy models can help you figure out problems with your annotation schemes, which can sometimes improve your model better than architecture (like vectors) or hyperparameters. In a 2018 talk, Matt called it the foundation of the "ML Hierarchy of Needs". Essentially, "categories that will be easy to annotate consistently, and easy for the model to learn."
Once you get promising results with your annotation schemes and performance, then you can test en_core_web_trf
. You could also experiment with different textcat
architectures.
Here's a related discussion (it was on ner
but same idea of speed/accuracy trade-off for base-models applies):
Last idea:
Also, I would recommend using the textcat.correct
recipe too. Don't worry about annotating as much as about getting a feel for how your model performs and where's its blind spots. Even better, correct any mistakes it's making and retrain.
If your current annotations are in textcat_data
and your model is my_textcat_model
, you can load that dataset as your source by prefixing dataset:
python -m prodigy textcat.correct correct_data my_textcat_model dataset:textcat_data ...
I think you may uncover some insight by correcting examples (and improve your model too!).
Let me know if you make progress!