Difference number examples dataset and batch-train

MBSanchez · August 28, 2019, 1:48pm

Hi!

I have a dataset with 4007 annotations:

Dataset 'energy_patterns'
Dataset       energy_patterns             
Annotations   4007               
Accept        517                
Reject        3423               
Ignore        67

but when I train a model ner.batch-train energy_patterns en_core_web_lg --output model_energy --eval-split 0.2 it is using only 309 examples for training and 50 for evaluation.

Loaded model en_core_web_lg
Using 20% of accept/reject examples (50) for evaluation
Using 100% of remaining examples (309) for training
Dropout: 0.2  Batch size: 10  Iterations: 10

Why is it not using all annotations in the dataset? I cannot see where does the difference come from. Could you help me with that?

Thanks!

ines · August 28, 2019, 2:58pm

Hi! I wrote about this in some more detail here:

I'm not sure what's in your data and how many unique examples you have – you could check that looking at how many unique input hashes there are:

from prodigy.components.db import connect
db = connect()
input_hashes = db.get_input_hashes(["energy_patterns"])
print(len(set(input_hashes)))

You could also set PRODIGY_LOGGING=basic to see if anything else is being skipped.

MBSanchez · August 28, 2019, 3:51pm

Hi! Yes, I have only 257 unique input hashes.

Thanks!

Topic		Replies	Views
Question about example data during ner.batch-train ner , spacy	2	611	July 29, 2019
Model Training & Dataset Exploration usage , ner	1	978	June 12, 2019
Debugging NER - batch_train with custom dataset ner	5	589	October 16, 2019
Which number of training labels should I trust	1	364	November 10, 2022
Deleting certain annotation sessions usage , database	1	1314	January 20, 2019

Difference number examples dataset and batch-train

Related topics