Improving a NER model with transformers (model size issue)

Hi,
I’m trying to improve a NER model for ancient Greek, a low resource language, with a transformer (spaCy model to be trained with 4000 NER annotations produced with Prodigy). I tried to pretrain my own transformer with the largest corpus I could build (1.7 GB), but this model negatively impacts the overall accuracy of the spaCy model: from 85 to 40. (I do not know why this is happening: too little pretraining data?)
So, I have turned to xlm-roberta-base that brings my model’s accuracy to 91%, but the size of the trained spacy model is huge, twice as large as en_core_web_trf. This must the result of the size of xlm-roberta-base. Which other multilingual transformer I could use that is not bert-base-multilingual, which does not perform as well as roberta for my case. I could not find a distilled version of xlm-roberta-base.

Or, there is a way to reduce the size of the spacy transformer model?

Thanks for the interesting question! The largest part of the XLM-RoBERTa base model is its vocabulary. Since the vocabulary uses ~250,000 pieces, 732MiB of the model's parameters are embeddings.

Did you try the bert-base-greek-uncased-v1 model? Since its vocabulary only consists of only 35,000 pieces, this model is considerably smaller than XLM-RoBERTa-base.

Another solution, which would require implementation work, would be to prune the vocabulary of your finetuned XLM-RoBERTa model. For example, you could run the model over a larger unannotated greek corpus, keep track of which pieces are used, and then remove the embeddings for the pieces that are never used. You could then map the piece identifiers from the tokenizer to the new embedding matrix.

Hi Daniel,

thanks for your response.

I will try both suggestions. I did not use bert-base-greek because it was trained in a modern Greek corpus, and xlm-roberta seems to have included (by mistake I guess) the ancient Greek texts which are on the web. But I will give a tray to bert-base-greek and then see if I can prune xlm-roberta. I will also experiment pretraining tok2vec with the corpus I have and see if I get comparable results.

Thanks