Missing data: TOTAL and SOURCE are inconsistent

Dear Prodigy team,

I'm experiencing a problem with the platform. It seems as if a portion of the source is always left out. Namely, the sum in the TOTAL in the GUI shows less than the actual total number of json lines in the source.jsonl file (snippet attached - the Total column refers to my source file and the Count column refers to the TOTAL in the GUI).

Additionally, sometimes the progress bar shows infinity and sometimes a percentage out of 100%. Either way, once the message "No tasks available" shows, there's still data left unannotated in the source file which isn't read by Prodigy.

When comparing the source file to the db-out file, this inconsistency is verified. That is, the source file contains additional data, and the number in the GUI's TOTAL is the same as the number of json lines in the db-out file.

I just noticed this inconsistency but checked my previous datasets and it's the same.

An interesting thing I noticed in this regard is that the portion of discarded unannotated jsonls from the source is roughly the same for all users (snippet attached - see Remaining column). It's as if Prodigy trims a constant percentage of the data from each of the sources.

Lastly, for one of the datasets, I got back more jsonls than I had originally placed in the source :man_shrugging:t2: (snippet attached).

Trying to understand the problem, I turned to the progress bar:
When it doesn't show infinity but a percentage of progress, it gives the wrong calculation with an overestimation. This raises my suspicion that the problem might be that Prodigy isn't actually reading all the data in the source file accurately to start with.

I am attaching a snippet of my Excel monitoring of the currently running batch. User3, for instance, has 5977 jsonls in the source file and annotated 710 thus far (the Count (#) column). While that is 11.87%, the progress bar shows 13%.

Any ideas guys? :slight_smile:
Many thanks! :pray:t3:


image

Progress bar:
image
image

hi @NNN!

Thanks for the question!

Could there be any duplicates in your data?

The total count and source will be inconsistent if there are duplicates (e.g., identical text records).

For example, this file has 200 records:
nyt_text.jsonl (21.0 KB)

But if you run:

python3 -m prodigy ner.manual nyt_data blank:en nyt_text.jsonl --label ORG

After labeling finishing labeling (aka hit "No tasks available"), the final count at the end will only be 178 (due to 22 duplicates):

(As a note, in the image above, if you notice "No tasks available" but the progress is at 90% because there are still records that are in the browser (client) that haven't been saved to DB. In the example above, if you were to click the Save Button, the progress would update to 100%. As mentioned in the link below, the Percentages only update from the server side, i.e., when new answers are sent to the DB, not in real-time.)

But if you review the data, you'll find that there are 22 duplicates (the i is in the meta field):

  • i=129 (1 record)
  • records i=140-150 (batch of 10)
  • records i=160-170 (batch of 10)
  • records i=183 (1 record)

One thing you could do to check if your file has duplicates manually:

from prodigy.components.loaders import get_stream

file_path = "nyt_text.jsonl"
stream = get_stream(file_path, rehash=True, dedup=True, input_key="text")

len(list(stream))
# 178

By default, Prodigy will dedup by task_hash.

It looks like you're not defining a unique session (aka, not using multi-user sessions) from your screen shot. In this case, the task_hash would act like the input_hash as you only have one unique task (annotator/task) per record. Can you confirm you had only one annotator and were saving it to a unique dataset for each user/round?

As another check, you can also run through this generated dataset of 100 unique records where each "text" is the record number (starting at 0 to 99).

sample_dedup.jsonl (1.4 KB)

Be cautious using the progress bar. Just for context, here's a good post that explains the Progress bar and how it's calculated:

Hope this helps!