The --memorize
flag will exclude examples that are already in the dataset. By "different annotators running on different machines", do you mean separate Python processes? Because if you run separate processes with the same input data and different target dataset names in the database, the questions should be asked in the exact same order, as they come in.
The only thing that's important to consider is the reloading:
If you refresh the browser, Prodigy will make a call to /get_questions
, which will fetch the next batch from the queue. There's no obvious way for the process to know whether questions that were previously sent out are still "in progress" (e.g. if an annotator is working on them) or if they were discarded. The app doesn't wait to receive a batch before it sends out new questions, because it needs to make sure the queue never runs out.
So in order to reconcile the questions/answers, there are different decisions to make depending on whether you want to have every user label the same data once, or whether you want some overlap, only one answer per question etc. That's something you'd have to decide – but a good solution is usually to have a single provider (e.g. a separate REST API or something similar) which keeps track of what to send out to multiple consumers. This could, for example, happen in the stream, which is a generator and can make a request to the provider on each batch.
Here's a pseudocode example thast illustrates the idea:
def stream():
next_questions = requests.get('http://your-api/?session-params').json()
for question in next_questions:
yield question
If every annotator is doing the same thing and you just want to make sure that all questions were really answered, you could also use a simpler approach, write a normal generator that yields data from your input file and add another loop at the end of your generator that checks the hashes ("_task_hash"
) of the existing answers and yields out the examples from the stream that aren't in the dataset yet, based on their hash.