Which number of training labels should I trust

nvasil · November 8, 2022, 7:46am

Screen Shot 2022-11-08 at 2.43.26 AM

I see this pattern

[ner] Training: 1738 | Evaluation: 326 (20% split)
Training: 352 | Evaluation: 88

Why is this discrepancy? Which one should I trust?
When I do prodigy tats the dataset has 1738 annotations and they are all accepted

ryanwesslen · November 10, 2022, 7:14pm

hi @nvasil!

Have you seen this related post?

I suspect you either have duplicates or you have merged entity spans of annotations on the same data. In the second case, if you’ve accepted/rejected several entities on the same text, those will be combined into one example.

Be sure to use logging PRODIGY_LOGGING=basic that should show the dedup step explicitly.

The final one is what you should go with (if you're comfortable with how Prodigy's is defaulting its behavior by deduping/merging entities, etc.).

Let me know if this helps!

Topic		Replies	Views
Difference number examples dataset and batch-train usage , ner , solved	2	563	August 28, 2019
ner.train number of examples usage , ner	8	1941	August 3, 2018
Debugging NER - batch_train with custom dataset ner	5	588	October 16, 2019
Does the train recipe in 1.9.0 auto-dedupe? textcat , solved	2	479	December 20, 2019
Training/Evaluation dilemma usage , ner	4	571	July 18, 2019

Which number of training labels should I trust

Related topics