Tasks are duplicated

ryanwesslen · June 6, 2023, 6:36pm

Thanks for your question and welcome to the Prodigy community

Could it be work-stealing? Did you notice these duplicates near the end of the stream?

We fixed a front-end bug was fixed in v1.11.9, which given you're using v1.11.11. But work stealing is still a possibility as we outlined in our v1.11.9 announcement:

It should be be noted though, that a small number of duplicates is still expected in multi-user workflows with feed overlap set to false. This is perfectly normal behavior and should only occur towards the end of the example stream.

These "end-of-queue" duplicates come from the work-stealing mechanism in the internal Prodigy feed. "Work-stealing" is a preventive mechanism to avoid records in the stream from being lost when an annotator requests a batch of examples to annotate, effectively locking those examples, and then never annotates them. This mechanism allows annotators that reach the end of a shared stream to annotate these otherwise locked examples that other annotators are holding on to. Essentially we have prioritized annotating all examples in your data stream at least once vs at most once while potentially losing a few.

One way to check this is if you see FEED: re-adding open tasks to stream in the logs:

One way to minimize the effect of work stealing is to inform your annotators to double-check that they save their most recent batch when they're done annotating.

In our Prodigy v1.12 alpha, one of the enhancements is more control/customization for task routing like the ability to turn off work_stealing via configuration allow_work_stealing:

Extended, fully customizable support for multi-annotator workflows . You can now customize what should happen when a new annotator joins an ongoing annotation project, how tasks should be allocated between existing annotators, and what should happen when one annotator finishes their assigned tasks before others. For common use-cases, you can use the options feed_overlap , annotations_per_task and allow_work_stealing (see the updated configuration documentation for details). Custom recipes can specify session_factory and task_router callbacks for full control.

But recall, that if you turn this off, you may now open yourself up to risks of losing some tasks getting lost in your stream.

It's worth noting that for our future v2 release, we're working to refactor our feeds which would eliminate this trade-off.

For the v2 release, we are working on a complete redesign of feed mechanism that will eliminate the need for work-stealing tradeoff altogether.

In the meantime, if you want more control on task routing, I'd strongly recommend giving our v1.12 alpha a try (we'd appreciate the feedback!). Here's a preview of the task routing docs (link may change in the future).

Hope this helps!

Topic		Replies	Views
Duplicate annotations in output Getting Started bug , to-be-released , streams	53	3514	January 27, 2023
Items or Task repetition problem usage	8	366	July 28, 2023
No tasks available in prodigy==1.11.8 when batch_size=1, instant_submit=True but there should be tasks available bug , ner , solved , multi-user	4	1005	January 24, 2023
Duplicated annotation when changing version ner , spacy	6	556	November 9, 2022
Examples from stream are shown twice usage , custom , streams	13	651	October 26, 2021

Tasks are duplicated

Related topics