Prodigy vs DistillBERT model

Hello everyone!

I have observed that prodigy is a really good tool to start with building awesome baselines for text classification/ner tasks. However, when it comes to production level specific tasks where accuracy and f1-score both are crucial, transformer based models do a better job than prodigy.

However whereas spacy models (when trained with blank:en) are as small as 8-10 MB, the transformer architecture models are as high as 700-800 MB. For a task of text classification which I am running, spacy gives an f1-score of 83% whereas distilBERT is giving me an f1-score of 94%

How can I obtain a better tradeoff between the model size and model performance. Is there anything which prodigy or spacy offers that could help get me a good accuracy without having to increase my model size too much?

Thanks & Regards,
Vinayak.

PS: I tried hyperparameter tuning with existing spacy/prodigy models including dropout, learning rate, train-valid split ratio etc. but that didn't yield any substantially better results...

Hi! When you're training based on blank:en, you're essentially training a model from scratch with no pretrained embeddings or anything as features. So it's not surprising that this model is performing worse than a model that's initialised with pretrained embeddings like DistillBERT etc. The model without embeddings is going to be much smaller (because there are no embeddings) and also much faster (because it doesn't have to encode anything using the embeddings), but it also doesn't have embeddings that it can use as features and take advantage of.

Tuning hyperparameters or experimenting with learning rates etc. can often give you a final 1 or 2% boost in accuracy, but it isn't going to move the needle compared to using pretrained embeddings.

So it sounds like in this case, it'll come down to finding pretrained embeddings that give you the best trade-off between size and speed, and accuracy. spaCy v3 makes it easy to run experiments with different transformer embeddings – you can just swap out the name of the transformer weights to initialise the model with in the config and compare the results. Maybe you can find some smaller BERT-like weights you can use?

If you're working with very domain-specific text, you could also experiment with training very specific word vectors and using them as features. If you prune the vectors well, you can end up with a very small vectors table, and very low overhead at runtime (see the notes on optimizing vectors coverage here).

Alternatively, spaCy's pretrain command lets you pretrain the token-to-vector embedding layer on raw text, using a language modelling objective. The available objectives are variations of the cloze task introduced for BERT and predict approximations, e.g. the n first and last characters, or the word's vector. This means that the resulting embeddings are generally more lightweight, both in size and in terms of runtime cost. You can read more about this here: https://spacy.io/usage/embeddings-transformers#pretraining-details