Hi all!
I'm currently trying to "convert" four spaCy 2.3 multilabel textcat models to spaCy 3.1 pipelines by retraining them from scratch with spaCy 3.1.
All four models were originally trained with spaCy 2.3 on corpara that contain between 530,000 and 350,000 news articles.
The labels used are topics used in the news business.
Two of the models have been trained for English, two were trained for German.
One model per language contains colloquial topics (Sports, Business, Politics, ...).
The other one uses the IPTC Mediacodes as labels.
The count of labels contained in the corpara ranges from 600 to 1,100 distinct labels.
The average word count of the documents is 350 words with a statistical mean of 250 words.
The machine used for training is an Intel Xeon W-2255 CPU @ 3.70GHz with 10 cores equipped with 128 GB RAM and a NVIDIA GeForce RTX 3090 with 24 GB RAM.
OS is Ubuntu server 20.04.
Training the models with spaCy 2.3 with 12 iterations took 2,5 days on average per each model.
The scores per label range from 0.75 for the less frequent topics and 0.95 for the very common ones.
The models are deployed in production and have sucessfully classified several hundred thousand articles per day.
Kudos to Ines and Matt for making spaCy such a great product. Well done.
Now that spaCy 3.1 has been released, I'm afraid that sooner or later spaCy 2.3 will reach end of life.
So I've decided to switch to the latest version.
As there's no way to simply convert the old models to spaCy 3.1 I started to retrain them from scratch.
So far with no success.
First of all spaCy 3.1 can't handle the huge corpora. So I splitted them into several smaller ones.
Even then spaCy will quit processing them when the amount of articles exceeds 300,000 articles.
So the next plan was to train the models sequentially using package sizes of 100,000 articles.
Unfortunately it will then process the articles at a speed of 1.2 articles per second.
I've tried further reducing the corpus and tried out the new streaming mechanism.
I've fiddeled around with every settings in the config.
Tried a config for CPU. Another for GPU. Another for eficcieny. Another for accuracy.
I've tried several batching configurations as well.
Tried the default batcher spacy.batch_by_word.v1 with different settings. Compound or static.
Switched to spacy.batch_by_sequence.v1. Tried several settings here as well.
But now way could I persuade spaCy to run any faster than 1.3 articles per second.
When training on spaCy 2.3.1 I've never had issues with the memory and I've never had issues with training speed on this machine and with these corpora.
I've actually trained part of a model again using spaCy 2.3.1 and the full 530,000 articles corpus to see if there's a problem with the machine.
spaCy 2.3.1 starts at a processing rate equally low to the one seen on spaCy 3.1 but then slowly climbes up to around 23 articles per second after about an hour and then stayes there.
550,000 articles minus 20 percent for evalutation leaves 440,000 articles.
440,000 articles a processing speed of 1.2 articles per second means 100 hours for a training epoch to finish.
Supposed that spaCy 3.1 needs 12 epochs like spaCy 2.3.1 did to reach the scores that I've gotten with production models, it would take 1,200 hours to retrain one model.
1,200 hours = 50 days. And I got three more models that need converting another few planned to support further languages.
If there's nobody around with a hint what else I could try to get this task done, I think I'll have to start looking for an alternative.
Or can I rest assured that spaCy 2.3.1 will be around for another few years and stick with it?