Different accuracy numbers for *same* training set.

This is an interesting situation with spaCy. It gives 3 different accuracy numbers for exactly the same training and test data trained at different points of time. Trained using ner.batch-train. All with --no-missing. vectors and lexemes (and all other files) inside the model have exactly the same size.

train_1:
{'uas': 0.0, 'las': 0.0, 'ents_p': 66.46754281162883, 'ents_r': 66.6001596169194, 'ents_f': 66.53378513055611, 'tags_acc': 0.0, 'token_acc': 100.0}
train_2 
{'uas': 0.0, 'las': 0.0, 'ents_p': 72.7313769751693, 'ents_r': 64.28571428571429, 'ents_f': 68.24825248887947, 'tags_acc': 0.0, 'token_acc': 100.0}
train_3:
{'uas': 0.0, 'las': 0.0, 'ents_p': 73.49397590361446, 'ents_r': 63.28810853950518, 'ents_f': 68.01029159519724, 'tags_acc': 0.0, 'token_acc': 100.0}

2 & 3 at-least are close enough. but 1 - there is a huge difference though that seems to be a stable model. Is there a reason why this is happening?

You mean you’re rerunning the same command and seeing different results?

Probably the difference is due to different random seeds – if the model is being initialised differently, or the data is shuffled differently, you can end up at pretty different solutions when the data set is small.

1 Like

The same data is stored in two different prodigy dbs. so, the name of the db changes in the command. but they both are exactly same gold-data created using prodigy. The corpus size is ~3000. I hope this is big enough. And the label distribution also real time. (but there is a big distributional deviation, one of the labels covers almost 50% and the least is 2%). With different shuffling - is there a possibility of precision changing from 66 to 72? that seems to be a big difference.

How are you evaluating the accuracy? Are you using a dedicated evaluation set? If you don’t provide your own evaluation set, the batch-train recipes will hold back a portion of the training data (20% for larger sets, 50% for small sets) for evaluation. This happens after shuffling, so depending on which examples end up in the training vs. evaluation set, this can easily explain differences in accuracy.

2 Likes

The training set is a manually corrected gold set and the test set is also a gold set different from the training set. I have used the same test set to evaluate all 3 models. used spacy scorer to evaluate.

Hmm. If you do prodigy db-out for the two files, and diff them, have you verified that they’re indeed exactly the same?

I’m having trouble seeing how you might be getting different results here, if the datasets really contain the same data. It’s not obvious to me what could be different.

1 Like

Yes, in fact i did that to conform if they both are using exactly the same data… there is no difference in diff.

I can send the dataset and the models if required. (I couldn’t understand much from the models.) But can’t post it here because it is sensitive data.

Okay, best to follow this up over email. Looking forward to getting to the bottom of this!

Hi Matthew

We are getting different F1-score on the same data when we train our NER model with same set-up,architecture and hyper-parameters on different machines. Although the score remains unchanged for every epoch of training on the same machine. The prodigy command is:

prodigy train ner temp_data,final_data en_vectors_web_lg --init-tok2vec ../pre_trained/tok2vec_cd8_model289.bin --output ./final_model --eval-split 0.2

I attach the screen-shots:

Please advise me
Thanks

Hi,

I trained a ner model in prodigy via train ner using annotated data I created via manual ner. Then, I converted my prodigy training set into a json training set to be trained with Spacy (w/o Prodigy wrapper). To my surprise the results didn't match. Here is what I got:

prodigy train ner my_training_set en_vectors_web_lg --output mymodel

:heavy_check_mark: Loaded model 'en_vectors_web_lg'
Created and merged data for 7713 total examples
Using 6171 train / 1542 eval (split 20%)
Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 10
Best F-Score 81.435

now convert to json.
prodigy data-to-spacy train.json eval.json --ner my_training_set
now train with spacy:
spacy train en ./spacy_train_model train.json eval.json -v en_vectors_web_lg

My best best F-score now is 55.004.

What am I doing wrong? The models should match in accuracy, right?

Thank you,
AK

@AK5 I've merged your question onto this thread, since it relates to the same topic. The contrast in your case (81 vs. 55) does seem pretty significant, though, so I wonder if there's something else going on here as well. Are you sure that the evaluation examples produced by the eval split are the same, and can you reproduce the same results if you're training with a dedicated evaluation set passed in as --eval-id, instead of letting the train command do the split?

Thanks for the reports, there are periodic spaCy bugs that do cause non-repeatability. We need to have better tests for the stability.

That said, as @ines points out, the difference you're seeing is surprisingly large, so it could be down to other factors.

If it does come down to training variation, it's likely that the current cause is different from the original one in January 2019, as there are lots of ways the variation can be introduced. Here's the current issue on spaCy: https://github.com/explosion/spaCy/issues/5551

Does Spacy report average F-score? If so, how can I get P/R/F per NER entity?
The figure for Prodigy I pointed out is Best F-Score 81.435. However, if I compute average for my 10 NER categories, it becomes 64.16. This is totally fine because it's around 80 for my 3 entities I care about.

The average F-score reported is the microaverage, rather than the macro-average. The spacy train function does report the per-entity scores in the accuracy.json file, or you can run spacy evaluate.

Prodigy's training recipes by default are a thin wrapper around spaCy, and they use the same Scorer object, which you can find documented here: https://spacy.io/api/scorer

I did a few more experiments and found a simple culprit. It turns out that if I replaced pipeline default of [ner,parse,tagger] with just ner (-p ner), everything went back to normal. I get F-scores closely matching between Spacy and Prodigy training sessions.

Hi team Prodigy

We are getting different F1-score on the same data when we train our NER model with same set-up,architecture and hyper-parameters on different machines. Although the score remains unchanged for every epoch of training on the same machine. The prodigy command is:

prodigy train ner temp_data,final_data en_vectors_web_lg --init-tok2vec ../pre_trained/tok2vec_cd8_model289.bin --output ./final_model --eval-split 0.2

I attach the screen-shots:

Please advise me
Thanks

@Mayank It looks like you already posted this before (see comment above) and I merged your posts onto this thread to keep the discussion in one place. Also see the comments above for details.

@Ines @honnibal

Since we didn't get a response to the query by @Mayank, we thought it would be better to do a specific post.

What should be our next step?
We're getting > 1% variation between team members on different machines on the exact same experiment.

Please advise.

Thanks,
Kapil

@kapilok If I'm reading the thread correctly, @honnibal did answer above? The underlying issue is an issue in spaCy, not Prodigy, and you can follow the thread here :slightly_smiling_face: