First of all, I just want to say that I am really excited by the latest batch of Prodigy! I really like the instant_submit and multi-user sessions. Thank you for all your work and receptiveness
I just have a few questions regarding the multi-user sessions on Prodigy:
Is there anyway to define the number of feed_overlap? I see from your manual that each annotation can only be annotated one or by everyone. But let's say, I just want it annotated three times. How would I go about that?
After playing with the multi-user session on the new Prodigy, I realized that you don't the name of the session (i.e. alex in session?=alex) anywhere in the example stored in the output, annotated JSON in the example database. Is there a way to include the name of the session in the JSON in the example database instead of having it just associated with the dataset?
Is there any way to programmatically retrieve the name of the session while the Prodigy interface is live? This is because I sometimes want the annotator to look at their own annotations again to double check if what they are annotating is correct. So I need to retrieve their data from the example/dataset database and and feed it back to their Prodigy session. But of course, I can't retrieve their data if I can't detect what session they are currently on.
Again, great work and we appreciate all your help!
Thanks! (Btw, just to make sure you saw it: The first v1.7.0 release included a small issue with the instant submit and we released a fix in v1.7.1 shortly after )
Not yet, but we're working on that! This is also a feature we want to have in Prodigy Scale. We're also currently investigating an issue that seems to occur with feed overlap set to false (see here).
Not at the moment, but I agree that this should be at least an available option. In the meantime, you could add this yourself via the REST API. I've outlined a solution here:
The session ID is passed to Prodigy when requesting a new batch and when the answers are sent back via the REST API – but it's not easily available in the stream you create in your recipe at the moment. I'll need to think about this some more and come up with a way to make this possible without breaking backwards compatibility or introducing too many arbitrary recipe hooks and callbacks.
Thank you for the informative reply as always! In that case, I have another three questions:
I notice that the ‘instant save’ only saves after the next task is loaded. Is there a way to instant save before a task is loaded?
As you know, Prodigy has an update function which we can customize - this function executes when Prodigy receives annotations. I was wondering if there was something similar to run a customized function before a task is loaded into Prodigy.
This is a question about the update function. The input to this update function is the answers array (i.e. update(answers)). In that answers array, can it contain the annotations from one session like this:
[{“session_id”: “mutli_server_alex”}, {“session_id”: “mutli_server_alex”}, {“session_id”: “mutli_server_alex”}]
Or more than one session like this:
[{“session_id”: “mutli_server_alex”}, {“session_id”: “mutli_server_kevin”}, {“session_id”: “mutli_server_elaine”}]
I am asking this because I basically want annotators to check their own work after they have finished annotating. So I was thinking of getting the session_id from the update function and using that to filter out any annotations that is not done by the annotator within the update function. The result is that the annotator will see and only correct his/her annotations.
I hope this makes sense - sorry if this is not super clear.
Yes, I think I know what you mean! (Also really appreciate your feedback on those super new features btw and you sharing what you have in mind for annotation quality control. That’s something we’ve been thinking about a lot for Prodigy Scale, so it’s very helpful to hear about the types of workflows users are trying to implement for this.)
Ultimately, I think a lot of what you’re trying to do would become much easier if there was an option to pass the session ID to the stream on each request. This would then allow you to do something like this:
def stream_with_session_id(session_id):
yield from examples # your raw data
annotations = db.get_examples(session_id)
yield from annotations # session annotations
You could then also add something like "round": 2 to each example that’s sent out again (so you can keep track of the revised examples and what the annotator edited). And there could be a third step that enqueues annotations by someone else, or maybe random duplicates to make sure that the annotators make consistent decisions.
The idea I outlined above sounds simple in theory, but it’ll be a bit tricky to implement in practice because there’s not actually such a direct coupling of stream API endpoint that sends out questions. So we’ll have to think about this some more to come up with a straightforward solution.
In the meantime, it might be easiest to implement your workflow with separate instances, if that’s an option. You can still use the get_session_id recipe component to generate named session IDs.
Is there any update on this? We are currently using separate instances, but it is a lot to manage when you have a pool of 25 labelers! Being able to access the session id from the stream generator would really help with our use case.
We require three judgements (from three different people) for each of many hundreds of data points. The problem with separate instances is that we must divide the dataset up ahead of time and so predetermine how many items each person can label. This means we can't handle the situation where some people are labeling more than the others.
For feed_overlap=true how do you keep track of which examples have been seen by each session?
Thanks for your question. Would you be interested in testing an early version of our upcoming release of v1.12? One feature we're adding is enhanced task routing, which will give developers more flexibility on how to allocate tasks across annotators. If so, I can reach out to you via email.