Dear Prodigy team,
I'm experiencing a problem with the platform. It seems as if a portion of the source is always left out. Namely, the sum in the TOTAL in the GUI shows less than the actual total number of json lines in the source.jsonl file (snippet attached - the Total column refers to my source file and the Count column refers to the TOTAL in the GUI).
Additionally, sometimes the progress bar shows infinity and sometimes a percentage out of 100%. Either way, once the message "No tasks available" shows, there's still data left unannotated in the source file which isn't read by Prodigy.
When comparing the source file to the db-out file, this inconsistency is verified. That is, the source file contains additional data, and the number in the GUI's TOTAL is the same as the number of json lines in the db-out file.
I just noticed this inconsistency but checked my previous datasets and it's the same.
An interesting thing I noticed in this regard is that the portion of discarded unannotated jsonls from the source is roughly the same for all users (snippet attached - see Remaining column). It's as if Prodigy trims a constant percentage of the data from each of the sources.
Lastly, for one of the datasets, I got back more jsonls than I had originally placed in the source (snippet attached).
Trying to understand the problem, I turned to the progress bar:
When it doesn't show infinity but a percentage of progress, it gives the wrong calculation with an overestimation. This raises my suspicion that the problem might be that Prodigy isn't actually reading all the data in the source file accurately to start with.
I am attaching a snippet of my Excel monitoring of the currently running batch. User3, for instance, has 5977 jsonls in the source file and annotated 710 thus far (the Count (#) column). While that is 11.87%, the progress bar shows 13%.
Any ideas guys?
Many thanks!
Progress bar: