Non-random batches across Annotators

vsocrates · October 3, 2022, 1:15pm

Hi!

Using "feed_overlap", we want to try and ensure all our annotations are seen by two annotators. We can't guarantee this because it seems like Prodigy randomly selects batches (of say, size 10) to present to annotators across sessions. Therefore, unless we annotate the entire dataset, we don't know that the annotators are annotating the same portion (e.g. 50%) of the dataset.

Is there a way to overcome this using the Prodigy config files, or would I have to write a custom recipe with sorters? If so, what would be the simplest way of modifying the textcat.manual recipe? Thanks!

ryanwesslen · October 3, 2022, 8:16pm

hi @vsocrates!

Why do you believe that Prodigy randomly sends batches? Was it just because of this example? This isn't the default behavior for Prodigy and was used a demo for when thinking about when you want to modify the order of your records (e.g., for Active Learning).

By default for non-active learning recipes (e.g., manual, correct, or review recipes), Prodigy's loaders will send out examples in the order they are loaded (e.g., order from the .jsonl or .txt files). This post explains:

If you want to see, writing a custom recipe can be helpful to prove it to yourself. For example, check out our recipe repo where we have additional recipes (and general examples of the default ones):

github.com

explosion/prodigy-recipes/blob/master/textcat/textcat_manual.py

from typing import List, Optional
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_string


# Helper functions for adding user provided labels to annotation tasks.
def add_label_options_to_stream(stream, labels):
    options = [{"id": label, "text": label} for label in labels]
    for task in stream:
        task["options"] = options
        yield task

def add_labels_to_stream(stream, labels):
    for task in stream:
        task["label"] = labels[0]
        yield task

# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to

This file has been truncated. show original

You can use this and print to console certain times (e.g., row/index ID from original data file) or add row to "meta" key so it is shown in the UI.

Also, keep an eye on logging too. This can help you see what's going on for which ones are being served.

Are you having multiple annotators simultaneously hit your same instance? Are you using sessions for multi-users?

There are a few posts that explain feed_overlap and mention issues like you're having:

If your goal is "ensure all annotations are seen by two annotators", one option could be to create two separate processes, each with a different port. Then use "force_stream_order": True. This would work great if you have two annotators and can assign them each their unique URL/port.

Also, I remember this post where this community member had an interesting workflow:

The key for this is how hashing and exclusion (e.g., exclude_by) can be used to exclude duplicates.

I suspect what may be happening is that you're having common challenges with multiple annotators. There are many issues that can occur when handling simultaneous annotators, e.g., if someone doesn't close their browser or save their work like work stealing:

That thread is detailed but it's important it raises several related issues/approaches (e.g., reduce your batch_size to 1 but then it prevents users from going back (by default the number is 10).

Last as an FYI, that post there is an experimental branch that modifies how examples are served in Prodigy (e.g., move to feed/database instead of generators, change ORM). While you can continue using the current approach (streams/generators), sometime in the future were going to implement changes aligned to the experimental branch for v2. I don't think moving to the experimental branch will help but simply want you to be aware of the work.

Hope this helps and let us know if you have further questions!

Topic		Replies	Views
Multi-session - annotators do not receive all tasks with feed_overlap with textcat.manual recipe textcat , streams	3	873	January 4, 2021
Allowing for a constant stream of examples in a multi-annotator setting usage , streams , multi-user	3	278	April 17, 2024
use custom textcat manual recipe in python with feed_overlap = False usage , textcat , solved	2	461	April 24, 2020
Issue in multi-session mode: duplicated annotation tasks and different order? enhancement , done , streams	19	2778	May 28, 2020
Inconsistency Number of Annotated Data ner , textcat	10	34	November 27, 2024

Non-random batches across Annotators

Related topics