We are not able to get more-or-less similar F-score when upgrading prodigy and spacy

Upgrade of spacy from version 2.3.7 to 3.2.3
Upgrade of prodigy from version 1.10.8 to 1.11.7


We have been using prodigy V1.10.8 (spacy v2) to annotate and train NER model.
Our entities (33) are "new" so we have been starting on blank model (blank/en). For example, we annotated the following: NB_SUBJECTS, SEX, TESTED_CONCENTRATION, APPLICATION_QUANTITY, NB_WITHDRAWALS, STUDY_START_DATE, …

Our input are pdf pages preprocessed to text. Annotations have been made to the page-level --> one record in our jsonl represents one page.

We have annotated 7,661 pages. However, +- 78% of them doesn't contain any entities at all.

When training under spacy v2 using the "spacy train" command, we obtained a F-score of 70.7%.

We would like to upgrade to prodigy V1.11.7 (spacy v3) and did some tests to verify that our performance doesn't decrease.

  1. First, we tried to train our NER model using the 7,661 annotated pages (transformed to spacy format and split into training and eval sets) under spacy v3 using the default config file (NER - CPU) found in https://spacy.io/usage/training#quickstart. This gave bad results (around 20%). We also tried to run a transformer model using spacy v3 and GPU with default config file, but we had a problem of memory. We tried to adjust the batch size parameters to overcome that issue, but we were not successful.

  2. Second, as we had a lot of pages that did not contain any entities, we resample those pages so that our dataset contains only 40% of pages with no entity. This gave us a dataset of 5,334 pages. We trained the model under spacy v2, giving 72.1% of F-score and under spacy v3, using again the default config file (NER- CPU) and setting the patience parameter to 0, giving 57.6% as F-score, which is below the performance under spacy v2. Using spacy v3 transformer gave again a memory error. Note that those F-score are not comparable with the other strategies as we do not use the same evaluation set.

  3. Third, we split our annotated pages into chunks because we thought that giving long text could be a problem. This gave us a lot of chunks that contained no entity. We decided then to resample those chunks to only have 40% of chunks with no entities. As a result, we trained our model using 5,252 chunks (split into training and eval) under spacy v2, v3 using CPU and v3 using transformer and GPU. The model using spacy v2 performed better than before and we were finally able to see some results under spacy v3 transformer but were not as good as in spacy v2. You can look at the table below which summarizes the results for more details. Again, those results are not comparable with the other strategies.

  4. Finally, in order to have comparable results, we decided to split the 5,334 pages used in our second test into chunks and train the model again. The eval set was here entire pages (same as in strategy 1), and not chunk, not resampled (1,533 pages). Again, results under spacy v3 CPU were not very good and spacy v3 transformer GPU could not be run.

V2   F-score   
V3   CPU F-score   
V3 GPU tr F-score   
7,661   pages   
+- 20%
Default config file   
Memory   error
Unsuccessful   test on batch size parameters   
5,334 pages    (7,661 pages resampled at 40%)   
Patience parameter at 0
Memory error
Unsuccessful test on batch size parameters   
5,252 chunks
(pages split into chunks then resampled)   
Patience parameter at 0   
Patience parameter at 0
Training:   17,233 chunks
Eval:   1,533 pages
(resample   and chunks only on training)   
Patience   parameter at 0   
Memory   error
Unsuccessful   test on batch size parameters   
F-score not comparable for strategy 2 and 3    

Do you know what could cause that decrease in F-score? Is there a parameter that can be changed in the config file that would help us get a F-score similar to what we had in spacy v2? What should be the best strategy to train our model using annotations coming from pages?

Thank you in advance for your answer,

1 Like