Which number of training labels should I trust

Screen Shot 2022-11-08 at 2.43.26 AM

I see this pattern

  • [ner] Training: 1738 | Evaluation: 326 (20% split)
    Training: 352 | Evaluation: 88

Why is this discrepancy? Which one should I trust?
When I do prodigy tats the dataset has 1738 annotations and they are all accepted

hi @nvasil!

Have you seen this related post?

I suspect you either have duplicates or you have merged entity spans of annotations on the same data. In the second case, if you’ve accepted/rejected several entities on the same text, those will be combined into one example.

Be sure to use logging PRODIGY_LOGGING=basic that should show the dedup step explicitly.

The final one is what you should go with (if you're comfortable with how Prodigy's is defaulting its behavior by deduping/merging entities, etc.).

Let me know if this helps!

1 Like