Difference number examples dataset and batch-train

Hi!

I have a dataset with 4007 annotations:

Dataset 'energy_patterns'
Dataset       energy_patterns             
Annotations   4007               
Accept        517                
Reject        3423               
Ignore        67        

but when I train a model ner.batch-train energy_patterns en_core_web_lg --output model_energy --eval-split 0.2 it is using only 309 examples for training and 50 for evaluation.

Loaded model en_core_web_lg
Using 20% of accept/reject examples (50) for evaluation
Using 100% of remaining examples (309) for training
Dropout: 0.2  Batch size: 10  Iterations: 10  

Why is it not using all annotations in the dataset? I cannot see where does the difference come from. Could you help me with that?

Thanks!

Hi! I wrote about this in some more detail here:

I'm not sure what's in your data and how many unique examples you have – you could check that looking at how many unique input hashes there are:

from prodigy.components.db import connect
db = connect()
input_hashes = db.get_input_hashes(["energy_patterns"])
print(len(set(input_hashes)))

You could also set PRODIGY_LOGGING=basic to see if anything else is being skipped.

Hi! Yes, I have only 257 unique input hashes.

Thanks!

1 Like