[Issue] Not using full dataset for trainning

Hi,
I am trying to carry out NER using prodigy.
I have my data labelled and imported into prodigy's db.
I am using prodigy train to carry out.
However, the issue is I only see a fraction of my data being used for training.
see below (see bold font for the issue):

=========================== Initializing pipeline ===========================
[2021-10-15 19:08:44,570] [INFO] Set up nlp object from config
Components: ner
Merging training and evaluation data for 1 components

  • [ner] Training: 12311 | Evaluation: 1224 (from datasets)
    Training: 2241 | Evaluation: 1224
    Labels: ner (13)
    [2021-10-15 19:08:55,148] [INFO] Pipeline: ['tok2vec', 'ner']
    [2021-10-15 19:08:55,148] [INFO] Created vocabulary
    [2021-10-15 19:08:55,148] [INFO] Finished initializing nlp object
    [2021-10-15 19:09:02,445] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
    :heavy_check_mark: Initialized pipeline

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 12311 | Evaluation: 1224 (from datasets)
Training: 2241 | Evaluation: 1224
Labels: ner (13)
:information_source: Pipeline: ['tok2vec', 'ner']
:information_source: Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE


As you can see it is only using 2241 samples out of the 12311 i am providing.
Would really appreciate if I could get some help with this
Thanks

Hi! What's in your data, and do you have multiple annotations on the same text? When the examples are merged before training, Prodigy will combine all annotations on the same text into one example to update the model with. So if you have multiple annotations, e.g. one annotation per label or duplicate annotations, you end up with fewer total examples at the end.