ordered tasks on "mark" receipe

Thanks, I hope this wasn't too much of an info dump. That kind of stuff is just something we had to think about a lot when planning and developing Prodigy Scale. And it turns out there are actually many different ways you could want to reconcile your annotations that are all totally valid depending on the project and goal. So it took us a while to get all of this right in a way that generalises well. The good news is, we're already testing those wrappers (we're calling them "feeds") in the Prodigy internals and are hoping to expose more of them in the Python API so it's easier to put together your own multi-user and multi-session streams.

Yes, it really just iterates over them in order. You can also see this if you check the source of the mark function in recipes/generic.py.

Yes – although, Prodigy would still be fetching one question in the background to keep the queue full enough. So you'd always have at least one question "in limbo". Otherwise, it'd be way too easy to hit "No tasks available" if you annotate too fast and the next batch hasn't come in from the server yet.

The "orphaned" batch is technically discarded by the web app, yes. The server doesn't know if you close the browser – maybe you were just offline for a while and still have unsent answers (which is currently no problem). The back-end will only know whether answers are missing once the process is stopped.

It can then get the hashes from the existing dataset, and compare them against the hashes of the input data. When a stream comes in that's not yet hashed, Prodigy essentially does this:

from prodigy import set_hashes

def stream(examples):
    for eg in stream:
        eg = set_hashes(eg)
        yield eg

This adds an "_input_hash" and a "_task_hash" to each example. The input hash describes the input data, e.g. the "text" or "image". The "_task_hash" is based on the input data and other annotations you might be collecting feedback on (like the "spans" or the "label"). This lets you distinguish between questions about the same text but with different labels etc., which is quite common in Prodigy. (You can also customise what the hashes are based on btw – see the API docs for set_hashes.)

Even if your process is still running, you could make another pass over your data once the first stream runs out, check if the hash is already in the dataset and if not, send the example out again. For example:

from prodigy import set_hashes
from prodigy.components.db import connect

db = connect()  # use settings from prodigy.json
hashed_stream = (set_hashes(eg) for eg in stream)

def stream_generator():
    for eg in hashed_stream:
        yield eg
    # all examples are sent out, go over them again and
    # compare against the hashes currently in the dataset
    dataset_hashes = db.get_task_hashes("your_dataset_name")
    for eg in hashed_stream:
        if eg["_task_hash"] not in dataset_hashes:
            yield eg
    # etc.