Missing data when

Kairine · March 18, 2020, 1:51pm

Hi,

When using sessions for a multi-user project, I notice that if an user leaves his session and then comes back (like when he closes the page and re-opens it in a few minutes), Prodigy kind of "skips" a few examples. As I understand it, Prodigy works by batches of data. So when I leave the session, Prodigy drops the current batch and when later I reconnect, it fetches a new batch, whether the old one is finished or not.

Then I tried fixing the batch-size parameter to 1, but it didn't work. I had a total of 10 examples. I left the session once and came back to finish, and I ended up with 9 annotations. To get this missing example I had to restart the workflow.

Do you have an idea how to fix this ? So that people can "log out" whenever they want and when they're back on they can pick up where they left off.

Thanks in advance

ines · March 18, 2020, 2:31pm

Hi! You can use the "force_stream_order": true setting in your prodigy.json to make the stream preserve the order of batches and examples and re-sent the current batch until it has been annotated. So if a user closes the browser and then reopen the app later in the same session, they'll start again with the most recent unannotated example. (Otherwise, that example would be queued up again, but only after you restart the server.)

The only scenario where this wouldn't work is if you're using an active learning-powered recipe with a "dynamic" stream, or if you have multiple people accessing the same session (because then, they'd all receive the same batch and you end up with duplicates).

If you want to send examples back immediately as they're annotated, you can set "instant_submit: true" in your prodigy.json. This can be helpful if you want your stream to be more responsive to the latest answer (e.g. to decide which follow-up examples to send out next). But it also means that there's no option to undo, because the answer is sent back immediately.

Kairine · March 18, 2020, 2:51pm

Great it worked ! Thank you !

rindranirina · May 17, 2021, 8:38pm

Hi,

I am experiencing an issue when the session completes with No tasks available but when I export the dataset with db-out, many examples were missed.

I set my prodigy.json with force_stream_order: false, feed_overlap: false.

I am using the ner.correct recipe with a nightly 1.11.0a8 version. My dataset is small.

I observed that this issue happens whether I used a multi-user session or not (main session).

I read your answer above but I am not sure if setting force_stream_order to true would be a solution since I am using NER.correct. is it a dynamic recipe?

If force_stream_order is set to false and I keep refreshing the browser without saving any tasks, will it exhaust my small dataset and then shows No tasks available?

After I see No tasks available, even if I restart the server, I don't see the remaining and missing examples. It still shows No tasks available.

ines · May 18, 2021, 4:52am

By "missed", do you mean that these examples weren't presented to you for annotation? Is it possible that your data has duplicates (which can easily happen if you annotate by sentence)? If the same text occurs more than once in the stream, Prodigy will skip it, so you're not labelling the same text twice. Examples that are already in the dataset will be skipped as well.

This thread here is slightly older and ner.correct now forces the stream order by default. So you shouldn't have to change anything. I don't think what you describe is related to that, or to sessions.

(Btw, it's not very helpful for us if you post the same question in multiple threads like this one, because it makes it harder to keep track of things and answer everyone.)

rindranirina · May 18, 2021, 6:06am

Thanks Ines for your support. My bad for the duplicate question. I will be mindful in future.

By missing, I meant the examples really went missing. My input file has over 100 examples but the output from db-out including ignored examples has only 65 examples.

I am sure there were not that many duplicates since each task corresponds to a paragraph and I verified that it is not the case. In addition, I use -U flag for unsegmented and I label the entity for an entire paragraph.

ines · May 19, 2021, 2:41am

If you're working with paragraphs, then duplicates are definitely less likely, yeah. (It just sometimes comes up with sentences because some sentences are surprisingly common and you may have multiple instances of "Please see attached." or something like that, depending on the data).

So just to confirm, when you re-start the server, you see "No tasks available"? This would indicate that all examples in the stream that are not skipped for some reason have already been annotated. If so, what type of input file are you using, and is it possible that it has invalid entries, like invalid JSON, or examples without a "text"? If you run Prodigy with the environment variable PRODIGY_LOGGING=basic, it should show you a log statement if examples where skipped.

rindranirina · May 19, 2021, 10:58pm

Yes, I should enable the logging first and check for skipped examples.

My initial input format was a .txt file with each line corresponding to a paragraph. I was also wondering about illegal characters

Could it also be related to the limit of yhe history stack? I reduced it to 5 assuming if it will save automatically the oldest tasks in my history.

ines · May 21, 2021, 3:28am

Prodigy will save automatically in the background once a full batch of annotations (of size batch_size) is ready, in addition to the examples shown in the history, which are kept there to allow you to hit "undo". Before you stop the annotation session, you should hit save to ensure that all unsaved examples are sent back to the server.

But even if you did forget to save at the end, Prodigy should queue up the examples again when you re-start the server, because they're not in the dataset yet. So the problem you describe sounds different.

Topic		Replies	Views
Missed examples on prodigy interface usage , solved , streams	3	803	May 17, 2021
Resuming annotations within a session (after closing the browser) usage , streams	6	1412	October 24, 2019
Example repeated/duplicated within and across sessions usage , textcat , multi-user	5	476	December 20, 2022
Few records in in the db for the same example usage	26	630	June 13, 2023
End of task hit when many task left usage , streams	5	556	March 26, 2020

Missing data when

Related topics