I have setup prodigy so that several annotators could manually classify text samples. The requirements are that all annotators must annotate every sample (but not necessarily in the same order). The dataset is quite small for this small test (500 samples). I therefore used the
feed_overlap: true setting with the
textcat.manual recipe. The problem is that several annotators did not receive all the tasks before getting a "No tasks available." message.
A few key points for our setup:
- We have six annotators (including myself), each with a different session name.
- The first two annotators could annotate all 500 samples, the next 3 could annotate about 350 each before seeing "No tasks available." and the last one only about 100
- I tried with a completely new session, I see the same "No tasks available."
- Even when I try to change the settings so that
"force_stream_order": true, no tasks are shown. Note that we started using this setting originally with a different dataset but stopped as we had duplicated tasks all the time.
Please note that:
- From time to time, I downloaded the results to check the ongoing annotations using
prodigy db-out textcat_posts
- The annotators annotated with a significant time delay between them (hours to days)
- Prodigy version:
1.9.4 (I don't have access to new upgrades anymore, unfortunately)
I used the following command to start the server:
prodigy textcat.manual textcat_posts data/posts.jsonl --label SPAM,EVENTS,<A FEW MORE> --exclusive
The prodigy config is as follow:
Any idea on what is going on and what I could do to have the next batch of annotations be sent to everybody? I plan on starting a much larger set of annotations for which all annotators are required to annotate all samples but with this current issue, I don't know if they will ever be able to see all annotations.
Thanks so much for your help!!
Hi! If your goal is to annotate all examples, you could also just start separate processes for the different annotators, and save their annotations to separate datasets? This is always the cleanest solution and it makes it easiest to reason about what's going on. And there's virtually no difference in the resulting data.
If you're on 1.9.x, your license should include all updates until v1.9.10. You can email us at firstname.lastname@example.org with your order ID, and we can send you the latest installer included with your order.
Hi @ines! Thanks a lot for your answer.
This is quite disappointing to be honest as it makes the management of different datasets and annotator processes quite painful. With that approach, each annotator would have to go to their own URL or to their own individual port on top of adding their session name.
feed_overlap ensure that in the first place? Right now, if someone finishes all the annotations, most of the other annotators cannot see any task, which sounds like a bug to me.
Isn't there any other option?
On another note, I could get access to version 1.9.10.
feed_overlap setting was mainly introduced to make it easier to annotate partial streams with multiple annotators, based on what's already present in the dataset and annotated by other people. (Before that, you had to implement a custom stream that kept checking the database, which worked, but was a bit inconvenient.) We later adjusted the feed overlap mechanism a couple of times and it should now behave the same as separate instances, for consistency (in the latest version of Prodigy – although, there may be more changes to the stream mechanism in the future). But it's still not something we'd necessarily recommend because it just adds another layer of abstraction.
(Tbh, I kinda regret shipping or at least documenting this so early – it's an internal API we added for Prodigy Teams and something people wanted to try, so we exposed parts of it. But it turned out to be a lot trickier to use in Prodigy Standalone.)
In this case, you wouldn't need a session name – you'd just have a unique URL instead of the same URL with an added unique session name. Under the hood, the dataset structure would be very similar, too – if you're using multi-user sessions, Prodigy will always create a separate session dataset, e.g.
If you have separate instances, your annotations would be all datasets starting with
dataset_name – or you could have your recipe add to
dataset_name as well, in which case, the resulting dataset structure would be identical to what you get with multi-user sessions.