Question about example data during ner.batch-train

kbarresi · July 26, 2019, 3:32pm

Hello,

When running ner.batch-train, I’ve noticed that there are several outputs regarding example data. For example:

 MODEL: Merging entity spans of 10420 examples
04:12:11 - MODEL: Using 10186 examples (without 'ignore')
04:12:19 - RECIPE: Temporarily disabled other pipes: ['tagger', 'parser']
04:12:19 - RECIPE: Initialised EntityRecognizer with model /prodigy/models/base_model
04:12:19 - PREPROCESS: Splitting sentences
04:13:15 - PREPROCESS: Skipping mismatched tokens
04:13:15 - PREPROCESS: Skipping mismatched tokens
04:18:00 - PREPROCESS: Splitting sentences
04:19:25 - MODEL: Merging entity spans of 11141 examples
04:19:28 - MODEL: Using 11141 examples (without 'ignore')
04:19:38 - MODEL: Evaluated 10207 examples
04:19:38 - RECIPE: Calculated baseline from evaluation examples (accuracy 0.00)

  0%|          | 0/43982 [00:00<?, ?it/s]
...

My question is how/why does it go from 10,420 examples, to 10,186, to 11,141, to 10,207, and then to 43,982 during the training iterations itself?

Likewise, what does “Skipping mismatched tokens” mean?

Thanks!

ines · July 28, 2019, 4:09pm

There are several steps here that change the total number of individual records:

Merging entity spans of annotations on the same data. So if you've accepted/rejected several entities on the same text, those will be combined into one example.
Splitting sentences. If you don't set --unsegmented, long texts will be split into sentences.
Filtering out ignored examples with "answer": "ignore".
Filtering out duplicates or otherwise invalid annotations.

This means that the tokenization of the model doesn't match the entity offsets annotated in the data. For example, if your entity is "hello" in "hello-world", but the tokenizer doesn't produce a separate token for "hello", the entity doesn't map to valid tokens and the model can't be updated in a meaningful way.

Did you import annotations created with a different process / model / tokenization? If your data was created with different tokenization, you can always provide your own "tokens" object on the data (see the README for format details). If you set PRODIGY_LOGGING=verbose, Prodigy will also show you the spans that it's skipping. In your case, it seems to be only 2, so it doesn't seem very significant given your dataset size.

kbarresi · July 29, 2019, 12:55pm

Great, that’s super helpful. Thanks!

Topic		Replies	Views
Difference number examples dataset and batch-train usage , ner , solved	2	571	August 28, 2019
Debugging NER - batch_train with custom dataset ner	5	617	October 16, 2019
ner.train number of examples usage , ner	8	1959	August 3, 2018
accuracy not improving much with ner.batch-train usage , ner	16	946	December 20, 2019
KeyError: 'token_end' when trying to use ner.batch-train ner , done	9	884	June 7, 2019

Question about example data during ner.batch-train

Related topics