Non-random batches across Annotators

Hi!

Using "feed_overlap", we want to try and ensure all our annotations are seen by two annotators. We can't guarantee this because it seems like Prodigy randomly selects batches (of say, size 10) to present to annotators across sessions. Therefore, unless we annotate the entire dataset, we don't know that the annotators are annotating the same portion (e.g. 50%) of the dataset.

Is there a way to overcome this using the Prodigy config files, or would I have to write a custom recipe with sorters? If so, what would be the simplest way of modifying the textcat.manual recipe? Thanks!

hi @vsocrates!

Why do you believe that Prodigy randomly sends batches? Was it just because of this example? This isn't the default behavior for Prodigy and was used a demo for when thinking about when you want to modify the order of your records (e.g., for Active Learning).

By default for non-active learning recipes (e.g., manual, correct, or review recipes), Prodigy's loaders will send out examples in the order they are loaded (e.g., order from the .jsonl or .txt files). This post explains:

If you want to see, writing a custom recipe can be helpful to prove it to yourself. For example, check out our recipe repo where we have additional recipes (and general examples of the default ones):

You can use this and print to console certain times (e.g., row/index ID from original data file) or add row to "meta" key so it is shown in the UI.

Also, keep an eye on logging too. This can help you see what's going on for which ones are being served.

Are you having multiple annotators simultaneously hit your same instance? Are you using sessions for multi-users?

There are a few posts that explain feed_overlap and mention issues like you're having:

If your goal is "ensure all annotations are seen by two annotators", one option could be to create two separate processes, each with a different port. Then use "force_stream_order": True. This would work great if you have two annotators and can assign them each their unique URL/port.

Also, I remember this post where this community member had an interesting workflow:

The key for this is how hashing and exclusion (e.g., exclude_by) can be used to exclude duplicates.

I suspect what may be happening is that you're having common challenges with multiple annotators. There are many issues that can occur when handling simultaneous annotators, e.g., if someone doesn't close their browser or save their work like work stealing:

That thread is detailed but it's important it raises several related issues/approaches (e.g., reduce your batch_size to 1 but then it prevents users from going back (by default the number is 10).

Last as an FYI, that post there is an experimental branch that modifies how examples are served in Prodigy (e.g., move to feed/database instead of generators, change ORM). While you can continue using the current approach (streams/generators), sometime in the future were going to implement changes aligned to the experimental branch for v2. I don't think moving to the experimental branch will help but simply want you to be aware of the work.

Hope this helps and let us know if you have further questions!