hi @auguststapput!
Thanks for your question and welcome to the Prodigy community
Could it be work-stealing? Did you notice these duplicates near the end of the stream?
We fixed a front-end bug was fixed in v1.11.9, which given you're using v1.11.11. But work stealing is still a possibility as we outlined in our v1.11.9 announcement:
It should be be noted though, that a small number of duplicates is still expected in multi-user workflows with
feed overlap
set tofalse
. This is perfectly normal behavior and should only occur towards the end of the example stream.
These "end-of-queue" duplicates come from the work-stealing mechanism in the internal Prodigy feed. "Work-stealing" is a preventive mechanism to avoid records in the stream from being lost when an annotator requests a batch of examples to annotate, effectively locking those examples, and then never annotates them. This mechanism allows annotators that reach the end of a shared stream to annotate these otherwise locked examples that other annotators are holding on to. Essentially we have prioritized annotating all examples in your data stream at least once vs at most once while potentially losing a few.
One way to check this is if you see FEED: re-adding open tasks to stream
in the logs:
One way to minimize the effect of work stealing is to inform your annotators to double-check that they save their most recent batch when they're done annotating.
In our Prodigy v1.12 alpha, one of the enhancements is more control/customization for task routing like the ability to turn off work_stealing
via configuration allow_work_stealing
:
Extended, fully customizable support for multi-annotator workflows . You can now customize what should happen when a new annotator joins an ongoing annotation project, how tasks should be allocated between existing annotators, and what should happen when one annotator finishes their assigned tasks before others. For common use-cases, you can use the options
feed_overlap
,annotations_per_task
andallow_work_stealing
(see the updated configuration documentation for details). Custom recipes can specifysession_factory
andtask_router
callbacks for full control.
But recall, that if you turn this off, you may now open yourself up to risks of losing some tasks getting lost in your stream.
It's worth noting that for our future v2
release, we're working to refactor our feeds which would eliminate this trade-off.
For the
v2
release, we are working on a complete redesign of feed mechanism that will eliminate the need for work-stealing tradeoff altogether.
In the meantime, if you want more control on task routing, I'd strongly recommend giving our v1.12 alpha a try (we'd appreciate the feedback!). Here's a preview of the task routing docs (link may change in the future).
Hope this helps!