Missing data when

Hi,

When using sessions for a multi-user project, I notice that if an user leaves his session and then comes back (like when he closes the page and re-opens it in a few minutes), Prodigy kind of "skips" a few examples. As I understand it, Prodigy works by batches of data. So when I leave the session, Prodigy drops the current batch and when later I reconnect, it fetches a new batch, whether the old one is finished or not.

Then I tried fixing the batch-size parameter to 1, but it didn't work. I had a total of 10 examples. I left the session once and came back to finish, and I ended up with 9 annotations. To get this missing example I had to restart the workflow.

Do you have an idea how to fix this ? So that people can "log out" whenever they want and when they're back on they can pick up where they left off.

Thanks in advance :smile:

Hi! You can use the "force_stream_order": true setting in your prodigy.json to make the stream preserve the order of batches and examples and re-sent the current batch until it has been annotated. So if a user closes the browser and then reopen the app later in the same session, they'll start again with the most recent unannotated example. (Otherwise, that example would be queued up again, but only after you restart the server.)

The only scenario where this wouldn't work is if you're using an active learning-powered recipe with a "dynamic" stream, or if you have multiple people accessing the same session (because then, they'd all receive the same batch and you end up with duplicates).

If you want to send examples back immediately as they're annotated, you can set "instant_submit: true" in your prodigy.json. This can be helpful if you want your stream to be more responsive to the latest answer (e.g. to decide which follow-up examples to send out next). But it also means that there's no option to undo, because the answer is sent back immediately.

Great it worked ! Thank you !

Hi,

I am experiencing an issue when the session completes with No tasks available but when I export the dataset with db-out, many examples were missed.

I set my prodigy.json with force_stream_order: false, feed_overlap: false.

I am using the ner.correct recipe with a nightly 1.11.0a8 version. My dataset is small.

I observed that this issue happens whether I used a multi-user session or not (main session).

I read your answer above but I am not sure if setting force_stream_order to true would be a solution since I am using NER.correct. is it a dynamic recipe?

If force_stream_order is set to false and I keep refreshing the browser without saving any tasks, will it exhaust my small dataset and then shows No tasks available?

After I see No tasks available, even if I restart the server, I don't see the remaining and missing examples. It still shows No tasks available.

By "missed", do you mean that these examples weren't presented to you for annotation? Is it possible that your data has duplicates (which can easily happen if you annotate by sentence)? If the same text occurs more than once in the stream, Prodigy will skip it, so you're not labelling the same text twice. Examples that are already in the dataset will be skipped as well.

This thread here is slightly older and ner.correct now forces the stream order by default. So you shouldn't have to change anything. I don't think what you describe is related to that, or to sessions.

(Btw, it's not very helpful for us if you post the same question in multiple threads like this one, because it makes it harder to keep track of things and answer everyone.)

Thanks Ines for your support. My bad for the duplicate question. I will be mindful in future.

By missing, I meant the examples really went missing. My input file has over 100 examples but the output from db-out including ignored examples has only 65 examples.

I am sure there were not that many duplicates since each task corresponds to a paragraph and I verified that it is not the case. In addition, I use -U flag for unsegmented and I label the entity for an entire paragraph.

If you're working with paragraphs, then duplicates are definitely less likely, yeah. (It just sometimes comes up with sentences because some sentences are surprisingly common and you may have multiple instances of "Please see attached." or something like that, depending on the data).

So just to confirm, when you re-start the server, you see "No tasks available"? This would indicate that all examples in the stream that are not skipped for some reason have already been annotated. If so, what type of input file are you using, and is it possible that it has invalid entries, like invalid JSON, or examples without a "text"? If you run Prodigy with the environment variable PRODIGY_LOGGING=basic, it should show you a log statement if examples where skipped.

Yes, I should enable the logging first and check for skipped examples.

My initial input format was a .txt file with each line corresponding to a paragraph. I was also wondering about illegal characters

Could it also be related to the limit of yhe history stack? I reduced it to 5 assuming if it will save automatically the oldest tasks in my history.

Prodigy will save automatically in the background once a full batch of annotations (of size batch_size) is ready, in addition to the examples shown in the history, which are kept there to allow you to hit "undo". Before you stop the annotation session, you should hit save to ensure that all unsaved examples are sent back to the server.

But even if you did forget to save at the end, Prodigy should queue up the examples again when you re-start the server, because they're not in the dataset yet. So the problem you describe sounds different.