Number of tasks doesn't match number of items in input file

Hello, we've been running ner.manual off a JSONL file with 1046 unique entries, but Prodigy now says it has no more tasks with only 711 in the database. What might be going on? Is there some reason Prodigy might have skipped certain lines?

Hi! When you restart the server with the same data and dataset, do you get new examples? Or do you see "no tasks available"?

If you see new tasks, one explanation could be that the app was refreshed in between:

If you don't see new tasks, it means that Prodigy thinks that all unique tasks are already in the database. Or, phrased differently: the task hashes created for the incoming examples are all already in the dataset.

Ah, okay. We restarted and there are new tasks, so it must have been reloads that caused it. Thank you!

1 Like

Followup question: when I restart Prodigy, it picks back up where it left off, but when a coworker uses the exact same command to launch it, it gives us "Total: 0". What's the reason for that?

What do you mean by "Total: 0"? The total count of annotations in the database? Maybe double-check that you're both connecting the same database (you can set PRODIGY_LOGGING=basic to see more debugging info). The default database is a local SQLite database on disk, so your coworker might be connecting to a new database on their local machine instead of whichever database you're storing your annotations in.

That's right, the progress section in the web UI shows a total of 0 annotations in the database when my coworker launches Prodigy.

When coworker launches Prodigy:

11:43:10 - DB: Initialising database SQLite
11:43:10 - DB: Connecting to database SQLite
11:43:10 - DB: Loading dataset 'CJMinorNamesOct2019' (0 examples)

When I launch Prodigy:

11:41:09 - DB: Initialising database SQLite
11:41:09 - DB: Connecting to database SQLite
11:41:09 - DB: Loading dataset 'CJMinorNamesOct2019' (942 examples)

We are using the default database and we are launching Prodigy from the same machine. We put Prodigy in a shared location on a VM which we're RDPing into.

I figured out what was happening -- Prodigy created a separate database file in both of our personal home folders. I was able to fix that by setting PRODIGY_HOME and adding an explicit path to a database file in a shared location.

1 Like

I am experiencing similar problems. I have a file with 315 examples that Im paying someone to annotate. After 205 there are no tasks left. This has been a recurring problem that Im trying to work around.
It seems unlikely that the annotator has refreshed the page over 100 times, clearly something else is going on. I also have feed overlap enabled, which appears to have something to do with the issues.

I am currently using Prodigy for small to medium size tasks, and I am evaluating it for a large scale annotation platform in my company. Not being able to trust the tool to actually annotate your full dataset is a problem. Being able to break the app by refreshing the page, and this not being considered a bug, is also not reassuring.

Please address these issues.

Hi! The default batch size is 10 examples, and each request asks for a single batch – so 110 examples (assuming there are no duplicates and no existing answers in the dataset) would mean that 11 batches were sent out but didn't come back answered.

The underlying problem here is that if you have multiple annotators working on the same data and a batch is sent out, Prodigy has no way of knowing whether it's coming back or not. Maybe someone is still working on it, maybe they're offline – it's not tracking the user in the app. So by default, the server will not re-send a batch to prevent duplicate annotations.

However, the batches are not gone or lost, and Prodigy keeps a very detaild record of the examples and annotations via the hashes it assigns. So if you restart the server, the unannotated examples are added back to the queue. Alternatively, you can also make your stream "infinite" and assume that examples that are not in the dataset yet after the first iteration should be sent out again until all hashes are in the dataset. Here's a code example that shows the idea. This works well for a finite stream that you don't necessarily need to annotate in order.

We'll also be adding a new feature that's a bit more complex and lets you enforce the exact ordering of batches that are sent out – see this thread for a more in-depth discussion. If this setting is enabled, the server would then always respond with the same batch until it has received the answers for it. However, the trade-off is that you can end up with duplicates if two people annotate in the same session (e.g. both accessing the app without a session name appended to the URL). So this way of handling the stream would work best if your annotators are all annotating the same data in their own separatre sessions (overlapping feed) and it's important that examples go out in the exact order they're loaded in.