There are several steps here that change the total number of individual records:
- Merging entity spans of annotations on the same data. So if you've accepted/rejected several entities on the same text, those will be combined into one example.
- Splitting sentences. If you don't set
--unsegmented
, long texts will be split into sentences. - Filtering out ignored examples with
"answer": "ignore"
. - Filtering out duplicates or otherwise invalid annotations.
This means that the tokenization of the model doesn't match the entity offsets annotated in the data. For example, if your entity is "hello"
in "hello-world"
, but the tokenizer doesn't produce a separate token for "hello"
, the entity doesn't map to valid tokens and the model can't be updated in a meaningful way.
Did you import annotations created with a different process / model / tokenization? If your data was created with different tokenization, you can always provide your own "tokens"
object on the data (see the README for format details). If you set PRODIGY_LOGGING=verbose
, Prodigy will also show you the spans that it's skipping. In your case, it seems to be only 2, so it doesn't seem very significant given your dataset size.