Question about example data during ner.batch-train

ines · July 28, 2019, 4:09pm

There are several steps here that change the total number of individual records:

Merging entity spans of annotations on the same data. So if you've accepted/rejected several entities on the same text, those will be combined into one example.
Splitting sentences. If you don't set --unsegmented, long texts will be split into sentences.
Filtering out ignored examples with "answer": "ignore".
Filtering out duplicates or otherwise invalid annotations.

This means that the tokenization of the model doesn't match the entity offsets annotated in the data. For example, if your entity is "hello" in "hello-world", but the tokenizer doesn't produce a separate token for "hello", the entity doesn't map to valid tokens and the model can't be updated in a meaningful way.

Did you import annotations created with a different process / model / tokenization? If your data was created with different tokenization, you can always provide your own "tokens" object on the data (see the README for format details). If you set PRODIGY_LOGGING=verbose, Prodigy will also show you the spans that it's skipping. In your case, it seems to be only 2, so it doesn't seem very significant given your dataset size.

Topic		Replies	Views
Difference number examples dataset and batch-train usage , ner , solved	2	564	August 28, 2019
Debugging NER - batch_train with custom dataset ner	5	605	October 16, 2019
ner.train number of examples usage , ner	8	1954	August 3, 2018
accuracy not improving much with ner.batch-train usage , ner	16	934	December 20, 2019
KeyError: 'token_end' when trying to use ner.batch-train ner , done	9	860	June 7, 2019

Question about example data during ner.batch-train

Related topics