Tasks are duplicated

Hi,

I have version 1.11.11 and I'm experiencing problems with the same task being given to the same annotator more than once. I run the following recipe:

python -m prodigy ner.manual ner_tagging blank:da ./input_file.jsonl --label ./labels_ner.txt
Warning: filtered 41% of entries because they were duplicates. 
Only 26352 items were shown out of 4490. 
You may want to deduplicate your dataset ahead of time to get a better understanding of your dataset size.

Starting the web server at http://localhost8080 ...
Open the app in your browser and start annotating!

Relevant to this, config.jsonl is set to the following:

"feed_overlap": false
"exclude_by": "input"

I have now received the following log and one of my annotators has informed me that they have been given the same task that they have done earlier (using the same /?session=user_name):

Front End Log - 2023-06-01 09:35:27+00:00: Duplicate _task_hash found in Frontend batch.

I used the following and checked the output for any duplicates:

python -m prodigy db-out ner_tagging ./output

The output file showed duplicate lines (one of them was present six times, which corresponded to the number of times my annotator estimated that they had done that task) with the exact same task_hash and input_hash. In fact, the entire line was identical except for the "_timestamp".

Importantly, the duplicated lines in the output were not originally duplicates - input_file.jsonl does not have those as duplicates, so it seems like it really is just giving the same task to the same annotator.

Is this something someone could help me with?

In advance, thanks!

hi @auguststapput!

Thanks for your question and welcome to the Prodigy community :wave:

Could it be work-stealing? Did you notice these duplicates near the end of the stream?

We fixed a front-end bug was fixed in v1.11.9, which given you're using v1.11.11. But work stealing is still a possibility as we outlined in our v1.11.9 announcement:

It should be be noted though, that a small number of duplicates is still expected in multi-user workflows with feed overlap set to false. This is perfectly normal behavior and should only occur towards the end of the example stream.

These "end-of-queue" duplicates come from the work-stealing mechanism in the internal Prodigy feed. "Work-stealing" is a preventive mechanism to avoid records in the stream from being lost when an annotator requests a batch of examples to annotate, effectively locking those examples, and then never annotates them. This mechanism allows annotators that reach the end of a shared stream to annotate these otherwise locked examples that other annotators are holding on to. Essentially we have prioritized annotating all examples in your data stream at least once vs at most once while potentially losing a few.

One way to check this is if you see FEED: re-adding open tasks to stream in the logs:

One way to minimize the effect of work stealing is to inform your annotators to double-check that they save their most recent batch when they're done annotating.

In our Prodigy v1.12 alpha, one of the enhancements is more control/customization for task routing like the ability to turn off work_stealing via configuration allow_work_stealing:

Extended, fully customizable support for multi-annotator workflows . You can now customize what should happen when a new annotator joins an ongoing annotation project, how tasks should be allocated between existing annotators, and what should happen when one annotator finishes their assigned tasks before others. For common use-cases, you can use the options feed_overlap , annotations_per_task and allow_work_stealing (see the updated configuration documentation for details). Custom recipes can specify session_factory and task_router callbacks for full control.

But recall, that if you turn this off, you may now open yourself up to risks of losing some tasks getting lost in your stream.

It's worth noting that for our future v2 release, we're working to refactor our feeds which would eliminate this trade-off.

For the v2 release, we are working on a complete redesign of feed mechanism that will eliminate the need for work-stealing tradeoff altogether.

In the meantime, if you want more control on task routing, I'd strongly recommend giving our v1.12 alpha a try (we'd appreciate the feedback!). Here's a preview of the task routing docs (link may change in the future).

Hope this helps!

Thank you for the response!

Hmm, it isn't happening at the end of the stream, and also it seems weird that it's the same tasks being duplicated up to six times (with the same hashes) while others are only shown once (as expected).

I asked my annotator and they are very good at making sure to press "Save" at the end of a session, so that also doesn't seem like the problem.

That's a fair point. I'm not ruling out there's something else -- task allocation is really hard and this is why we're massively overhauling it.

If you do see it again, can you confirm if you did or didn't see FEED: re-adding open tasks to stream?

What would help most is if you can find a replicable example: recipe, prodigy.json, logs, data (I know that may be hard).

I'm still trying to understand this but it's hard without a fully reproducible example. Any additional info would be very helpful for us to investigate further.