Missing data: TOTAL and SOURCE are inconsistent

hi @NNN!

Thanks for the question!

Could there be any duplicates in your data?

The total count and source will be inconsistent if there are duplicates (e.g., identical text records).

For example, this file has 200 records:
nyt_text.jsonl (21.0 KB)

But if you run:

python3 -m prodigy ner.manual nyt_data blank:en nyt_text.jsonl --label ORG

After labeling finishing labeling (aka hit "No tasks available"), the final count at the end will only be 178 (due to 22 duplicates):

(As a note, in the image above, if you notice "No tasks available" but the progress is at 90% because there are still records that are in the browser (client) that haven't been saved to DB. In the example above, if you were to click the Save Button, the progress would update to 100%. As mentioned in the link below, the Percentages only update from the server side, i.e., when new answers are sent to the DB, not in real-time.)

But if you review the data, you'll find that there are 22 duplicates (the i is in the meta field):

  • i=129 (1 record)
  • records i=140-150 (batch of 10)
  • records i=160-170 (batch of 10)
  • records i=183 (1 record)

One thing you could do to check if your file has duplicates manually:

from prodigy.components.loaders import get_stream

file_path = "nyt_text.jsonl"
stream = get_stream(file_path, rehash=True, dedup=True, input_key="text")

len(list(stream))
# 178

By default, Prodigy will dedup by task_hash.

It looks like you're not defining a unique session (aka, not using multi-user sessions) from your screen shot. In this case, the task_hash would act like the input_hash as you only have one unique task (annotator/task) per record. Can you confirm you had only one annotator and were saving it to a unique dataset for each user/round?

As another check, you can also run through this generated dataset of 100 unique records where each "text" is the record number (starting at 0 to 99).

sample_dedup.jsonl (1.4 KB)

Be cautious using the progress bar. Just for context, here's a good post that explains the Progress bar and how it's calculated:

Hope this helps!