This is an interesting situation with spaCy. It gives 3 different accuracy numbers for exactly the same training and test data trained at different points of time. Trained using ner.batch-train. All with --no-missing. vectors and lexemes (and all other files) inside the model have exactly the same size.
You mean you’re rerunning the same command and seeing different results?
Probably the difference is due to different random seeds – if the model is being initialised differently, or the data is shuffled differently, you can end up at pretty different solutions when the data set is small.
The same data is stored in two different prodigy dbs. so, the name of the db changes in the command. but they both are exactly same gold-data created using prodigy. The corpus size is ~3000. I hope this is big enough. And the label distribution also real time. (but there is a big distributional deviation, one of the labels covers almost 50% and the least is 2%). With different shuffling - is there a possibility of precision changing from 66 to 72? that seems to be a big difference.
How are you evaluating the accuracy? Are you using a dedicated evaluation set? If you don’t provide your own evaluation set, the batch-train recipes will hold back a portion of the training data (20% for larger sets, 50% for small sets) for evaluation. This happens after shuffling, so depending on which examples end up in the training vs. evaluation set, this can easily explain differences in accuracy.
The training set is a manually corrected gold set and the test set is also a gold set different from the training set. I have used the same test set to evaluate all 3 models. used spacy scorer to evaluate.
Hmm. If you do prodigy db-out for the two files, and diff them, have you verified that they’re indeed exactly the same?
I’m having trouble seeing how you might be getting different results here, if the datasets really contain the same data. It’s not obvious to me what could be different.
We are getting different F1-score on the same data when we train our NER model with same set-up,architecture and hyper-parameters on different machines. Although the score remains unchanged for every epoch of training on the same machine. The prodigy command is:
I trained a ner model in prodigy via train ner using annotated data I created via manual ner. Then, I converted my prodigy training set into a json training set to be trained with Spacy (w/o Prodigy wrapper). To my surprise the results didn't match. Here is what I got:
prodigy train ner my_training_set en_vectors_web_lg --output mymodel
Loaded model 'en_vectors_web_lg'
Created and merged data for 7713 total examples
Using 6171 train / 1542 eval (split 20%)
Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 10
Best F-Score 81.435
now convert to json.
prodigy data-to-spacy train.json eval.json --ner my_training_set
now train with spacy:
spacy train en ./spacy_train_model train.json eval.json -v en_vectors_web_lg
My best best F-score now is 55.004.
What am I doing wrong? The models should match in accuracy, right?
@AK5 I've merged your question onto this thread, since it relates to the same topic. The contrast in your case (81 vs. 55) does seem pretty significant, though, so I wonder if there's something else going on here as well. Are you sure that the evaluation examples produced by the eval split are the same, and can you reproduce the same results if you're training with a dedicated evaluation set passed in as --eval-id, instead of letting the train command do the split?
Thanks for the reports, there are periodic spaCy bugs that do cause non-repeatability. We need to have better tests for the stability.
That said, as @ines points out, the difference you're seeing is surprisingly large, so it could be down to other factors.
If it does come down to training variation, it's likely that the current cause is different from the original one in January 2019, as there are lots of ways the variation can be introduced. Here's the current issue on spaCy: https://github.com/explosion/spaCy/issues/5551
Does Spacy report average F-score? If so, how can I get P/R/F per NER entity?
The figure for Prodigy I pointed out is Best F-Score 81.435. However, if I compute average for my 10 NER categories, it becomes 64.16. This is totally fine because it's around 80 for my 3 entities I care about.
The average F-score reported is the microaverage, rather than the macro-average. The spacy train function does report the per-entity scores in the accuracy.json file, or you can run spacy evaluate.
Prodigy's training recipes by default are a thin wrapper around spaCy, and they use the same Scorer object, which you can find documented here: https://spacy.io/api/scorer
I did a few more experiments and found a simple culprit. It turns out that if I replaced pipeline default of [ner,parse,tagger] with just ner (-p ner), everything went back to normal. I get F-scores closely matching between Spacy and Prodigy training sessions.
We are getting different F1-score on the same data when we train our NER model with same set-up,architecture and hyper-parameters on different machines. Although the score remains unchanged for every epoch of training on the same machine. The prodigy command is:
@Mayank It looks like you already posted this before (see comment above) and I merged your posts onto this thread to keep the discussion in one place. Also see the comments above for details.
@kapilok If I'm reading the thread correctly, @honnibal did answer above? The underlying issue is an issue in spaCy, not Prodigy, and you can follow the thread here