prodigy train result is different with the spacy train result, why?


I have created an annotated NER dataset using Prodigy. After finishing the annotation, I export the dataset using Prodigy's data-to-spacy CLI.

Then I did the training process with CLI both on Spacy and Prodigy. After the training process finishes, I have found that it produces different results. The best f-score for the Prodigy train result is 81.792, and the Spacy train result is 82.323. Why does it's happening? From what I knew, it should produce the same results.

For your information, I am using Prodigy v1.10.8 and Spacy v2.3.5. Here are the steps that I have done:

1. python -m prodigy data-to-spacy ./train-70.json ./dev-30.json --lang id --ner my-dataset --eval-split 0.3
2. prodigy train ner my-dataset blank:id --output ./prodigy-model --eval-split 0.3 --n-iter 10
3. python -m spacy train id ./spacy-model ./train-70.json ./dev-30.json -p ner -n 10

The config file for both of the processes is the same:


Thank you

Hi! Older versions of Prodigy (v1.10 and below) used their own training loop implementation with its own default settings, so it's possible that a small difference in dropout, learning rate or batching can easily account for a small difference in accuracy of +/- 1%.

This was actually one of the main motivations we standardised the training process in v1.11+ to call into spacy train directly, and it's also a good example of why the config system in spaCy v3 is useful for reproducible experiments, because it prevents hidden defaults.

Hi Ines

Thank you. So, is it okay to use the resulted data-to-spacy command in Spacy V3?

Yes, that's the recommended workflow once you're serious about training your model :slightly_smiling_face:

Hello Ines

My purpose of using the spacy train CLI is to do some experiments related to the number of iterations like in the prodigy train CLI to observe the score of the metrics. Is it possible? I have already tried using the spacy train in Spacy V3, but I cannot find how to do this.

Thank you

I'm not sure I understand what you're trying to do here? What do you mean by the score of the metrics? If you're looking for per-label stats, you can get the same results and more using spacy evaluate.

Hi Ines

What I have done with Prodigy and Spacy V2 is doing several experiment based-on the number of train split data and number of iteration, for example train split 70:30, iteration 50, 75, 100, and 500 plus 80:20 train split and 50, 75, 100, and 500 iteration. From those experiments, I observed the overall f-score, recall, and precision for all entity to determine the best model (the highest f-score). After that I will observe more details on the metrics for all of the entities.

I want to migrate that experiments using Spacy V3.