Lesser annotations for training despite having more annotations in database

ElisonSherton · September 24, 2020, 8:40am

Hello guys!

I have been using prodi.gy and spacy for a while now and I must thank you for building such an awesome product! It has really made coding, debugging and annotating a lot easier and more organised than ever for NLP tasks.

I had a small issue concerning training using train recipie for ner models. Have a look at the screenshot below:

The training happens smoothly no issue. However look at the stats of this dataset. I have 1000 annotations which are all having the answer key "accept". However, when I start building a model, I can see that there's only 837 samples available for training. What is done with the remaining 163 samples? Do they never appear in the training; and if they're truncated, why so?

Thanks & Regards,
Vinayak.

ines · September 25, 2020, 7:42am

Hi! If you look at the examples in your dataset, are there any duplicate annotations, e.g. annotations on the same text? Or did your data end up with examples that have the same hashes? This would cause examples to be merged before training, so you end up with a lower number than what's in your actual dataset.

Topic		Replies	Views
Difference number examples dataset and batch-train usage , ner , solved	2	563	August 28, 2019
Which number of training labels should I trust	1	364	November 10, 2022
[Issue] Not using full dataset for trainning usage , training	1	417	October 18, 2021
Deleting certain annotation sessions usage , database	1	1314	January 20, 2019
ner.train number of examples usage , ner	8	1948	August 3, 2018

Lesser annotations for training despite having more annotations in database

Related topics