Resuming annotations within a session (after closing the browser)

Hi Ines, Matthew and Prodigy community,

Thanks again for the fantastic tool!!

I would like to use the "ner.manual" recipe with custom user sessions for annotating a large-ish dataset. I can run the recipe and serve it successfully. The problem is: whenever a user closes the browser, the tasks are gone. I would like them to be able to return to the same session_id and resume the annotations wherever they stopped.

From what I could read from the forum, this seems to have been a design decision:

I think I have a very similar issue to the one reported above, which was addressed by using on_load() and by checking hashes in the DB. The difference is that I would like to achieve the same behavior without having to create new sessions -- as I am using session_ids to describe annotators (e.g. "john", "mary"). Is there any other way to overcome this behavior -- of tasks disappearing after the browser is closed -- and just feeding the stream until all tasks for the dataset are saved in the DB?

This is my code:

import os
import prodigy

jsonl_example = '/tmp/example.jsonl'
with open(jsonl_example, 'w') as f:
    f.write(
"""{"text": "This is a sample sentence"}
{"text": "Another one"}
{"text": "and another one"}""")

os.environ["PRODIGY_ALLOWED_SESSIONS"] = 'john,mary'
prodigy.serve('ner.manual', 'my_dataset', 'en_core_web_sm', jsonl_example, 
              None, None, ['A', 'B', 'C'], None, port=65000)

Thanks in advance!

Hi! It sounds like what you're describing sounds pretty similar to the feature requested in this thread (not sure how important the exact order is in your case, though).

Prodigy cannot know whether a batch of examples that was sent out is still being annotated and coming back, or if it was discarded (e.g. because a the user closed the browser). So it will only send out the next batch. When you restart the server, Prodigy will check against the database again and send out examples that haven't been annotated yet. You can also write a custom "infinite" stream that keeps checking against the hashes in the dataset and makes sure all unannotated examples are resent again.

For the next version, we're also planning on shipping a new stream type that lets you enforce ordering and makes sure questions are re-sent if they haven't come back, before sending out the next batch. (The only trade-off is that this can lead to duplicate questions if the user accesses the session twice simultaneously – like, in two browser tabs or something. But that should be easy to prevent.)

1 Like

This sounds awesome! I'm looking forward to it :slight_smile:

Thank you very much for the quick reply, Ines!

Just to clarify, when you say that I could write a "custom infinite stream", do you mean that this stream would avoid the need to restart the server, or would this step still be required?

I tried to use filter_tasks() to achieve the behavior I wanted, but it only works when I restart server. Then the already annotated tasks are not sent for labeling:

from prodigy.components.filters import filter_tasks
stream = filter_tasks(stream, DB.get_task_hashes(dataset))

Is this what you had in mind? If not, could you please a share a pointer, in case it exists? Thanks, again!

+1! This is great! Looking forward to the next release :slight_smile:

Also props to @justindujardin who implemented the new stream/feed logic :raised_hands:

Sorry for the unclear phrasing! But yes, what I meant was a stream generator that just keeps looping over the data and checks against the hashes already present in the database on each iteration. This means that examples that were previously skipped (e.g. because the user hit refresh) are queued up again later. Here's an example implementation:

Many thanks, Justin! o/

Thank you for the pointer, Ines. It is working like a charm! :slight_smile:

1 Like