Textcat results seems worse in new prodigy version

I just noticed that v1.11 yields worse results than v.10.8. I know the accuracy is crazy high but the task is quite easy. Looking the following outputs I'm a little concerned about using the new training script. Any comments? Actually I'm not even sure if the same score are being used?

I only have one label currently but each document can have multiple labels in the future

v1.10.8

❯ prodigy train textcat tags-earnings blank:en
✔ Loaded model 'blank:en'
Created and merged data for 30340 total examples
Using 24272 train / 6068 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
ℹ Baseline accuracy: 0.573

=========================== ✨  Training the model ===========================

#    Loss       F-Score 
--   --------   --------
1    90.53      0.998                                                                                                                
2    0.05       0.999                                                                                                                
3    0.04       0.999                                                                                                                
...

v1.11.1

❯ prodigy train model --textcat-multilabel tags-earnings --base-model blank:en --gpu-id 0
ℹ Using GPU: 0

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
[2021-08-19 10:59:23,965] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 24850 | Evaluation: 6212 (20% split)
Training: 24348 | Evaluation: 6162
Labels: textcat_multilabel (1)
[2021-08-19 10:59:44,940] [INFO] Pipeline: ['textcat_multilabel']
[2021-08-19 10:59:44,942] [INFO] Created vocabulary
[2021-08-19 10:59:44,942] [INFO] Finished initializing nlp object
[2021-08-19 11:04:03,859] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 24850 | Evaluation: 6212 (20% split)
Training: 24348 | Evaluation: 6162
Labels: textcat_multilabel (1)
ℹ Pipeline: ['textcat_multilabel']
ℹ Initial learn rate: 0.001
E    #       LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  -------------  ----------  ------
  0       0           0.25       23.22    0.23
.venv/lib/python3.9/site-packages/thinc/backends/ops.py:575: RuntimeWarning: overflow encountered in exp
  return cast(FloatsType, 1.0 / (1.0 + self.xp.exp(-X)))
  0     200          29.42       26.12    0.26
  0     400          24.75       47.24    0.47
  0     600          14.49       82.22    0.82
  0     800           9.31       94.85    0.95
  0    1000           9.09       95.81    0.96
  0    1200           6.06       95.97    0.96
  0    1400           8.99       93.77    0.94
  0    1600           5.46       98.40    0.98
  0    1800           4.66       96.37    0.96
  0    2000           6.60       97.91    0.98
  0    2200           6.03       98.20    0.98
  0    2400           1.88       99.06    0.99
  0    2600           4.84       99.30    0.99
  0    2800           5.86       99.04    0.99
  0    3000           2.18       98.79    0.99
  0    3200           1.81       96.60    0.97
  0    3400           6.08       98.86    0.99
  0    3600           5.32       97.38    0.97
  0    3800           1.12       98.75    0.99
  0    4000           2.05       99.10    0.99
  0    4200           6.30       99.47    0.99
  0    4400           2.72       98.94    0.99
  0    4600           1.83       98.82    0.99
  0    4800           1.85       97.64    0.98
  0    5000           4.61       98.63    0.99
  0    5200           2.04       98.21    0.98
  0    5400           2.85       99.18    0.99
  0    5600           3.72       97.34    0.97
  0    5800           2.62       98.71    0.99
✔ Saved pipeline to output directory
model/model-last

Also what does E and # denote in the header?

"E" refers to the number of epochs that you've trained, with one epoch representing one pass over all the data. Typically, one epoch consists of multiple batches or steps, which is what's denoted with "#".

I think Prodigy 1.11 has changed the console output formatting slightly to be more in line with how spaCy presents results, and the previous format (like what you're showing for 1.10.8) denoted epochs with "#", which is obviously confusing.

It does look like the results might be the same though, just differently presented. The 1.11 training breaks off before the first epoch is finished, because the performance is already so good (0.99) and not further increasing anymore.

You mentioned "worse" results, do you mean 0.999 versus 0.99?

Yeah, Prodigy now essentially calls into spacy train, so the output you see is identical to what spacy train produces. It looks like the main difference here is that spaCy v3 shows you a more detailed breakdown of the intermediate steps and only 2 digits (which is usually enough).

Otherwise, the results seem identical and as @SofieVL mentioned, almost better in a way because spaCy v3 gets to the 0.99 quicker after only one epoch instead of needing two.

Thanks for the replies.

Are you sure that E is epoch in the new while # is epoch in the old? I'm asking because running in the new setup took a lot longer than in the old setup - I don't have the runtimes but I can investigate. So when I mentioned worse results it seemed to me that the new version seemed to require a lot more work to achieve similar results - you're saying it's the other way around and I would be convinced if it wasn't for the slow runtime.

Also can I force spacy train to run a specific number of epochs to make sure it doesn't stop although it thinks it's good enough?

Hey,

prodigy train creates a default config where you have little control over - but I'd recommend switching to running prodigy data-to-spacy first, which creates the data and config files, and then running spacy train. Inbetween those steps you can then fiddle with the config file to control the printing parameters & the stopping criteria. Specifically, these fields would be important to you in the [training] block of the config:

  • eval_frequency : Set at 200 by default, this setting can make the whole process slower, as every evaluation over the dev set takes some time. If you increase the value, you'd get less lines printed and it should speed up
  • patience: this controls the "early stopping" behaviour you noticed. Its default value is 1600, which means that it'll stop after 1600 steps if no improvement was seen. Set this to 0 to disable this alltogether.
  • max_epochs : max numer of epochs to train for. This defaults to 0 ("indefinite"), but you'll want to put a value here if you set patience to 0.

To really measure whether the textcat is worse/slower, ideally then you'd train 2 models for the same number of epochs, with the same number of evaluations (lines printed) and then you'd measure performance of both models on some independent test set.