Multiple annotators without personal repetition

Thanks for making Prodigy - it’s a great tool. I am aware that multiple annotators support is on the roadmap and not supported but trying to make something work in the meantime.

Would be great if the same task wasn’t asked from the same annotator, but was shown to others.

Trying to adapt the code from mark recipe below and wondering if there is a way to hash the task after adding the annotator field to the task task['annotator'] = annotator.
Here is the mark scenario code that I am hoping to modify somehow:

        for eg in stream:
            if TASK_HASH_ATTR in eg and eg[TASK_HASH_ATTR] in memory:
                answer = memory[eg[TASK_HASH_ATTR]]
                counts[answer] += 1
            else:
                yield eg

Thanks – always nice to see what others are building with Prodigy!

To answer your question: Yes, there’s a prodigy.util.set_hashes() helper function that does exactly that. It looks like this:

Argument Type Description
task dict The annotation task to hash.
input_keys list / tuple The task attributes to consider when generating the input hash. Default: ('text', 'image', 'html', 'input').
task_keys list / tuple The task attributes to consider when generating the task hash. Default: ('spans', 'label').
overwrite bool Overwrite already existing hashes.
RETURNS dict The annotation task with added hashes.

For example:

task_keys = ('spans', 'label', 'annotator')
hashed_tasks = [set_hashes(eg, task_keys=task_keys, overwrite=True) for eg in tasks]

The hashing works like this:

  • If one or more of the keys are present in the task, their values are concatenated and hashed using mmh3.
  • If no keys are found, the full task is dumped as JSON and hashed instead.
  • For the task hash, the input hash is added as a prefix to the concatenated values (or JSON dump) before hashing. This ensures that the task hash is always generated with respect to the original input.

Another solution would be to do the annotation management upstream of Prodigy: you would have another service which owned a central data feed, which would split out work and send it to each annotator. Inside Prodigy, you would just be pulling tasks from a local service and using that as the stream.

This is probably how we'll end up doing things, because we think the annotation management system should really be a separate tool. If it were inside Prodigy it would be an entirely separate subcommand that ran as a service. It seems much clearer to break it out into its own executable.

1 Like