Slow training on multilabel textcats

Hi all!

I'm currently trying to "convert" four spaCy 2.3 multilabel textcat models to spaCy 3.1 pipelines by retraining them from scratch with spaCy 3.1.
All four models were originally trained with spaCy 2.3 on corpara that contain between 530,000 and 350,000 news articles.
The labels used are topics used in the news business.
Two of the models have been trained for English, two were trained for German.
One model per language contains colloquial topics (Sports, Business, Politics, ...).
The other one uses the IPTC Mediacodes as labels.
The count of labels contained in the corpara ranges from 600 to 1,100 distinct labels.
The average word count of the documents is 350 words with a statistical mean of 250 words.
The machine used for training is an Intel Xeon W-2255 CPU @ 3.70GHz with 10 cores equipped with 128 GB RAM and a NVIDIA GeForce RTX 3090 with 24 GB RAM.
OS is Ubuntu server 20.04.

Training the models with spaCy 2.3 with 12 iterations took 2,5 days on average per each model.
The scores per label range from 0.75 for the less frequent topics and 0.95 for the very common ones.
The models are deployed in production and have sucessfully classified several hundred thousand articles per day.

Kudos to Ines and Matt for making spaCy such a great product. Well done.

Now that spaCy 3.1 has been released, I'm afraid that sooner or later spaCy 2.3 will reach end of life.

So I've decided to switch to the latest version.

As there's no way to simply convert the old models to spaCy 3.1 I started to retrain them from scratch.
So far with no success.
First of all spaCy 3.1 can't handle the huge corpora. So I splitted them into several smaller ones.
Even then spaCy will quit processing them when the amount of articles exceeds 300,000 articles.
So the next plan was to train the models sequentially using package sizes of 100,000 articles.
Unfortunately it will then process the articles at a speed of 1.2 articles per second.

I've tried further reducing the corpus and tried out the new streaming mechanism.
I've fiddeled around with every settings in the config.
Tried a config for CPU. Another for GPU. Another for eficcieny. Another for accuracy.
I've tried several batching configurations as well.
Tried the default batcher spacy.batch_by_word.v1 with different settings. Compound or static.
Switched to spacy.batch_by_sequence.v1. Tried several settings here as well.
But now way could I persuade spaCy to run any faster than 1.3 articles per second.

When training on spaCy 2.3.1 I've never had issues with the memory and I've never had issues with training speed on this machine and with these corpora.

I've actually trained part of a model again using spaCy 2.3.1 and the full 530,000 articles corpus to see if there's a problem with the machine.
spaCy 2.3.1 starts at a processing rate equally low to the one seen on spaCy 3.1 but then slowly climbes up to around 23 articles per second after about an hour and then stayes there.
550,000 articles minus 20 percent for evalutation leaves 440,000 articles.
440,000 articles a processing speed of 1.2 articles per second means 100 hours for a training epoch to finish.
Supposed that spaCy 3.1 needs 12 epochs like spaCy 2.3.1 did to reach the scores that I've gotten with production models, it would take 1,200 hours to retrain one model.
1,200 hours = 50 days. And I got three more models that need converting another few planned to support further languages.

If there's nobody around with a hint what else I could try to get this task done, I think I'll have to start looking for an alternative.

Or can I rest assured that spaCy 2.3.1 will be around for another few years and stick with it?

Not really an answer to your question, but have you tried alternatives like fasttext? Given the description of your task, it sounds like fasttext (or another library) might be a better fit.

Edited to add: another suggestion we would make for large text classification problems is vowpal wabbit.

Hi Adriane,

thanks for the reply.

To be honest, I'd rather stick with spaCy than start to experiment with new technologies. But thanks for pointing me to them as an alternative.

What I don't understand, is why something that worked extremely well in spaCy 2.3 wouldn't work in spaCy nextGen.

If I wouldn't be afraid of spaCy 2.3 reaching end of life I'd just stay with it. So what about Explosion's future plans? How long will spaCy 2.3 be supported?

On the spacy side of things, the memory usage while training is definitely higher in v3 (due to otherwise advantageous Example objects, which are helpful for lots of types of annotation but not really for cats). I wouldn't have expected such large speed differences while training, but the textcat architectures did see a number of changes in v3 and it's possible that some of the differences are only noticeable with large numbers of labels.

In terms of the training corpus size in RAM, I think that streaming the training corpus should help: https://spacy.io/usage/training#custom-code-readers-batchers. You'd have to set your custom corpus reader up to shuffle and provide the examples multiple times in order to have the same number of epochs as before. During training, a streamed corpus will look like one long epoch 0 in the output. The dev corpus still needs to be finite and fit in memory.

But I am not sure about the speed issue, unless your server is always thrashing while training due to the corpus size and that is responsible for the speed difference.

Thank you again for the quick reply Adriane,

About the speed issue:

It might be that I'm mistaking there and that we need to clear a few technical terms first. :grinning:

When training with spaCy 2.3 I used prodigy as a "frontend" to spaCy with a slightly customized version of the included train recipe.

The basic train recipe accepted an iterations argument and I always assumed that iterations actually meant epochs.

During training in the training progress information view I would see a row for each iteration/epoch with a progress bar that went from example 0 to the number of total training examples in the corpus. Additionally the progress bar contained information about the rate of examples processed per second. That's source of my statement that spaCy 2.3 processed 23 examples per second.

Since I first posted my issues with multilabel textcat training I've made more experiments. And I'm not sure anymore that the information that spaCy 3.1 shows in the progress bar is comparable to what spaCy 2.3 showed. My impression ist now, that the info that spaCy 3.1 shows as processing rate is not examples per seconds anymore but processed batches per second. Additionally the figures under the # column don't seem to show the number of processed examples but rather the number of processed batches.

If this is not the case, I have no idea why spaCy 3.1 starts the next epoch before it has finished the previous one. And yes, the values in the config for max_steps (0 = infinite) and patience are set high enough that it shouldn't stop processing an epoch at an early stage. So the figures under the # column must be the count of completed (mini-)batches. Or should I say steps?

If this is the case, the process speed is measured in steps/batches per second as well because when dividing the figure from the # column by the number of processed items per second you'll get the time it took the progress bar to reach its end.

Now: If the (min-)batch/step size is set to compounding this would clarify why I don't see an increase of the processing speed. Actually a slow decrease of the process speed from the very beginning of the training to the point where the compounding function reaches it's maximum could easily be explained, as the (min-)batches/steps get bigger and thus take longer to process. (Time is relative. :grinning: )

And if so far I am right with my assumptions, and we assume a max (mini-)batch/step size of 16 examples and multiply these with the 1.27 it/s I'm currently seeing we'll end up at a processing rate of 16 * 1.27 = 20,32 examples per second which isn't far away from the 23 examples I used to see with spaCy 2.3.

16 examples per (min-)batch/step is only an assumption because that was the value that the old spaCy/prodigy versions used as a default when setting the batch size to -1.

As the default config for multilabel text classification now uses a spacy.batch_by_words.v1 batcher with compounding(100, 1000, 1.001) and my average document size is 330 words, I have now idea how big the (mini-)batches/steps really are. As the discard_oversize parameter is set to false and tolerance is set to 20 percent, I doubt that spaCy will ignore any of the examples and really have my doubts whether the batch_by_words batcher does make any sense at all in textcat training.

Yes, I had seen this and already tried it. Due to the speed issues I had given up on it, but I am currently retraining the german model for the colloquial news topic with the 550,000 examples corpus. Currently about 25 % of my 124 GB RAM are in use. What a waste of space. :grinning:

With training running over the weekend with the beginning of next week I'll have some more insight about how things went.

If my assumptions on what is what in spaCy's information view are correct, unfortunately I'll never know how many epochs it took and how many examples were processed to get to the point the model (pipeline) will be at at that time. I had thought about implementing some kind of logging in the custom code reader, but decided that this might slow down the data retrieval.

And as training will now run in an infinite loop, and never come to end, I'll never see the score per topics. Guess I will have to ask the meta.json to know how good or bad we're doing.

Anyhow: Sorry for another lengthy post. Thanks for bearing with me. And could you please confirm whether my findings about what spaCy displays in the training info view are correct?

TIA and enjoy the weekend. Bye for now.

The default prodigy train output is now closer to spacy train a bit different from spacy v2. The E column is epochs and the # column is steps, where each step is a batch. The eval lines are then every N steps rather than related to the epoch count and don't necessarily correspond to a particular point at the beginning/end of an epoch.

Because the batch size typically increases during training (by default it increases from 100 to 1000), the number of epoches (or partial epochs) between each eval line increases as training goes on. You can see in the example here that the first couple eval lines are for less than one epoch but towards the end it's more than one epoch between each eval: https://prodi.gy/docs/recipes#train. The epoch number also just means that this eval line was somewhere within this epoch, but it doesn't tell you where it was within the epoch.

The training output doesn't show the text count directly. You can set max_steps or max_epochs if you know exactly how long you'd like to train it for (well, max_epochs obviously won't work for a streamed corpus), but usually we'd recommend using patience, which is an early stopping option that waits for N steps after the point where the overall score stops improving. Here are all the training settings: https://spacy.io/api/data-formats#config-training

For our internal spacy pipeline training we set a fairly high max_steps and then patience stops it before it gets to max_steps in nearly all cases.

Does that help?

Hi Adriane,

Does that help?

yes it does. Over the weekend I had implemented a logger in my custom corpus reader that writes a log entry after every 500th line that was read. The results from this log confirmed my findings and your expertise. :grinning:
It also helped explain why textcat training in spaCy 3 is still slower than it is in spaCy 2.
When you use the config widget on the spaCy website with the following settings:

language: German
components: text
text classification: exclusive textcategories unchecked
Hardware: CPU
Optimize: Efficency

it'll generate a config that contains the spacy.batch_by_words.v1 batcher with the following settings:

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

In my case where one example contains 330 words on average, it means that before compounding reaches its max value of 1000 words the train batches on average will contain only one example. Once compounding reaches the max value the batch will contain 3 documents on average.

Apparently the time it takes a batch to get processed is more or less fixed and is a litte under a second. So in order to speed up training the batches need to be bigger here.

So I raised the start value for compounding to 350 and the stop value to 5,500 words.
So we'd start with approx. one example and after about 2,500 iterations would run with 16 examples per batch.


[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 350
stop = 5500
compound = 1.001
t = 0.0

Additionally I set patience to 90,000 (assuming 16 examples per batch this should be equal three epochs) and the eval_frequency to 30,000 (should be equal one epoch).
Max. epochs is set to -1 (streaming examples) and max_steps = 0 (infinite).

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 90000
max_epochs = -1
max_steps = 0
eval_frequency = 30000
frozen_components = []
annotating_components = []
before_to_disk = null

After two epochs I then looked at the cat scores per type in the meta.json of the best-model. Interestingly I only found 36 types that showed none zero values. The other 556 types only showed zero values for precision, recall and f-score.

I then stopped the training and started again with the spacy.batch_by_sequence.v1 batcher and the following settings:

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 1
stop = 16
compound = 1.001
t = 0.0

I assume that this is equal to the batching that prodigy used with spaCy 2 when setting batch size to -1 ?

So far only one epoch has finished and the results in meta.json show the same effect.

So I really don't know how to continue and I've got some new questions for you:

Will the other classes get score results in the next epochs?
Or is it the batch sizes are now to large and that's the reason why the training now only trains the most frequent classes?
Or do we need more that one evaluation during an epoch?

Hope your patience is at a high enough value to find me some more answers. :slight_smile:

Thank you again.

P.S.:

Had meant to ask this before.

E    #       LOSS TEXTC...  CATS_SCORE  SCORE
---  ------  -------------  ----------  ------
  0       0         148.25       38.10    0.38
Epoch 1:   1%|█                                                                                                                                                       | 209/30000 [02:39<6:23:50,  1.29it/s]
/home/kamiwa/.pyenv/versions/prodigy3/lib/python3.8/site-packages/thinc/backends/ops.py:576: 
RuntimeWarning: overflow encountered in exp
  return cast(FloatsType, 1.0 / (1.0 + self.xp.exp(-X)))
Epoch 1:  38%|███████████████████████████████████████████████████████▊                                                                                            | 11309/30000 [2:33:03<4:12:05,  1.24it/s]

I have seen this warning in another post here on the forum as well.
Is it something we should worry about?

I don't think the RuntimeWarning should affect the results.

With the streaming corpus, I suspect that the component isn't getting initialized with all the labels, since by default it only peeks at the first 100 examples. Try using a list of labels in the [initialize] block with require = true, see: https://spacy.io/api/top-level#read_labels

You probably don't want to process your whole corpus with init labels, so it would be easier to run it over a small subset to get the format right (not all components are this simple, but for textcat I think it's just a JSON list) and then add all your labels by hand.

Sorry for not mentioning it before but I already added a reader for the labels in my config.
And as you said, it's just a JSON file containing a list of the labels. I've just checked again and it is complete. So this is not the culprit.

[initialize.components]

[initialize.components.textcat_multilabel]

[initialize.components.textcat_multilabel.labels]
@readers = "spacy.read_labels.v1"
path = "/opt/python/prodigy3/pportal_train_data/labels/textcat_multilabel.json"
require = true

What about the batch sizes and the evaluation frequency?

If evaluation only takes place at the end of an epoch, is that enough? Or do we need several evaluation points during an epoch?

How do batch sizes effect the training result? What is the recommended max size of a batch for multilabel textcat training?

Hi Kai,

Apologies for the late response - I think the last questions you posed are difficult to answer in general, and are often use-case and dataset dependent. Either way, if you've ensured that the labels are properly initialized, it should be OK. If there are many labels and a class imbalance though, the classifier might just get too little information (relatively speaking) about some of the less frequent labels, and might not perform well for those. We're aware of this limitation in spaCy and it's why Adriane initially recommended trying out other libraries like fasttext.

That said, it should still technically work and you'd assume that after several epochs, you would at least get non-zero scores for most of the classes. It's difficult to say what the ideal batch size should be do, it might take a little experimentation...