Question about example data during ner.batch-train

Hello,

When running ner.batch-train, I’ve noticed that there are several outputs regarding example data. For example:

 MODEL: Merging entity spans of 10420 examples
04:12:11 - MODEL: Using 10186 examples (without 'ignore')
04:12:19 - RECIPE: Temporarily disabled other pipes: ['tagger', 'parser']
04:12:19 - RECIPE: Initialised EntityRecognizer with model /prodigy/models/base_model
04:12:19 - PREPROCESS: Splitting sentences
04:13:15 - PREPROCESS: Skipping mismatched tokens
04:13:15 - PREPROCESS: Skipping mismatched tokens
04:18:00 - PREPROCESS: Splitting sentences
04:19:25 - MODEL: Merging entity spans of 11141 examples
04:19:28 - MODEL: Using 11141 examples (without 'ignore')
04:19:38 - MODEL: Evaluated 10207 examples
04:19:38 - RECIPE: Calculated baseline from evaluation examples (accuracy 0.00)

  0%|          | 0/43982 [00:00<?, ?it/s]
...

My question is how/why does it go from 10,420 examples, to 10,186, to 11,141, to 10,207, and then to 43,982 during the training iterations itself?

Likewise, what does “Skipping mismatched tokens” mean?

Thanks!

There are several steps here that change the total number of individual records:

  1. Merging entity spans of annotations on the same data. So if you've accepted/rejected several entities on the same text, those will be combined into one example.
  2. Splitting sentences. If you don't set --unsegmented, long texts will be split into sentences.
  3. Filtering out ignored examples with "answer": "ignore".
  4. Filtering out duplicates or otherwise invalid annotations.

This means that the tokenization of the model doesn't match the entity offsets annotated in the data. For example, if your entity is "hello" in "hello-world", but the tokenizer doesn't produce a separate token for "hello", the entity doesn't map to valid tokens and the model can't be updated in a meaningful way.

Did you import annotations created with a different process / model / tokenization? If your data was created with different tokenization, you can always provide your own "tokens" object on the data (see the README for format details). If you set PRODIGY_LOGGING=verbose, Prodigy will also show you the spans that it's skipping. In your case, it seems to be only 2, so it doesn't seem very significant given your dataset size.

Great, that’s super helpful. Thanks!