[Issue] Not using full dataset for trainning

hashir · October 15, 2021, 6:16pm

Hi,
I am trying to carry out NER using prodigy.
I have my data labelled and imported into prodigy's db.
I am using prodigy train to carry out.
However, the issue is I only see a fraction of my data being used for training.
see below (see bold font for the issue):

=========================== Initializing pipeline ===========================
[2021-10-15 19:08:44,570] [INFO] Set up nlp object from config
Components: ner
Merging training and evaluation data for 1 components

[ner] Training: 12311 | Evaluation: 1224 (from datasets)
Training: 2241 | Evaluation: 1224
Labels: ner (13)
[2021-10-15 19:08:55,148] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-10-15 19:08:55,148] [INFO] Created vocabulary
[2021-10-15 19:08:55,148] [INFO] Finished initializing nlp object
[2021-10-15 19:09:02,445] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
Initialized pipeline

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 12311 | Evaluation: 1224 (from datasets)
Training: 2241 | Evaluation: 1224
Labels: ner (13)
Pipeline: ['tok2vec', 'ner']
Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE

As you can see it is only using 2241 samples out of the 12311 i am providing.
Would really appreciate if I could get some help with this
Thanks

ines · October 18, 2021, 8:29am

Hi! What's in your data, and do you have multiple annotations on the same text? When the examples are merged before training, Prodigy will combine all annotations on the same text into one example to update the model with. So if you have multiple annotations, e.g. one annotation per label or duplicate annotations, you end up with fewer total examples at the end.

Topic		Replies	Views
Problem in training the model usage , ner	10	598	May 26, 2020
Debugging NER - batch_train with custom dataset ner	5	589	October 16, 2019
Training NER does not make any progress usage , ner , training	3	862	December 16, 2021
ner.train number of examples usage , ner	8	1945	August 3, 2018
Training few new entities: Result very low usage , ner , spacy	3	17	January 29, 2025

[Issue] Not using full dataset for trainning

Related topics