Training with multiple annotators

I’ve recently started using Prodigy and am trying to grasp the workflow around having multiple annotators. I understand that you can have multiple sessions using ‘?session=NAME’.

After I get all my annotations, which of them will ner.batch-train use. It is possible that two different annotators labeled the same example differently…which of them would the model use?

Also, if I don’t specify the “?session=NAME” piece, will it treat multiple annotators as one session?

It looks like it does - but both "default" sessions still seem to see the same example

Hi! The ?session marker lets you explicitly name the user sessions if you want to do everything within one Python process. However, you can also just start multiple processes on different ports and have your annotations add to separate datasets. This is often cleaner and makes it easier to compare the annotations later on.

Yes, if you don't name the session, all annotations will be added to one default session.

This is something that Prodigy can't decide for you – that's something you have to decide :slightly_smiling_face: If you trained a model with conflicting annotations, it typically ignores them, because there's no valid gold-standard annotation that the model can learn from.

If you need to reconcile annotations from different annotators that may be conflicting, check out the new review recipe. It lets you load in one or more datasets and will group all annotations on the same input text together. You can then see who annotated what and where the conflicts are – and create one correct "master annotation". See here for a little video that shows the process in action:

https://twitter.com/_inesmontani/status/1130585864030052354

Thanks! That is a useful recipe.

I am still confused by whether different annotators see the same question (all assuming I have one provider process). I am thinking of the following scenarios:

  1. Multiple people open the default session. Will see all the same tasks UNLESS ‘feed_overlap’ is set to false.

  2. Two named sessions are created: Will see all the same tasks UNLESS ‘feed_overlap’ is set to false.

  3. Are these behaviors consistent across tech and mark/manual.

My practical goal right now is to have a set-up where I can have unlimited annotators share the work (I am asking people on my team to help out when they have time). I am guessing it makes sense to direct everyone to the default session - but I want to make sure that people are not doing duplicate work.

(I would like to understand the other scenarios for future reference).

Thanks

If multiple people access the same default session, they’ll all get different examples – the next batch in the stream. That’s because Prodigy doesn’t know who they are and treats them all as “the same person”. So whenever a request for new questions comes in, it’ll send the next batch that’s available.

Some things to consider here:

  • Whenever someone accesses the app (or reloads the page), they’ll get a new batch. Prodigy can’t know that a batch it sent out for annotation isn’t “coming back”. Maybe someone is working on it and taking a long time, maybe they internet connection died, and so on. This is typically difficult to work around. So you might want to implement an “infinite stream” that periodically checks the database and sends examples out again if they’re not in the dataset yet. This also gives you much more fine-grained control over what’s sent out when. I’ve explained an approach for this step-by-step in my comment here.
  • If you’re planning on using active learning-powered recipes like ner.teach that update a model in the loop, the process may not be as effective if multiple people are annotating and updating the model. In the best case scenario, they’ll all make similar decisions and move the model in the same direction. In the worst case, they try to move the model in different directions and as a result, make it suggest worse annotations.

(Btw, quick heads-up if you’re working with the feed_overlap setting: There’s currently a known issue that tends to occur with short streams and causes subsequent sessions to not see examples if the previous session already completed the stream. If you’re hitting that, see here for details and a workaround. We’ll be fixing that in the next version.)

Thanks for answering my questions!
This should be enough to get me started :slight_smile:

1 Like

Hi!
I have a setup where I have two groups of annotators, and I'd like each group to work on a separate set of examples. Is there any way I can assign each batch (1000 sentences) to a group of annotators but collect answers on the same database?

For example:
annotator (X,y) -> Batch 1 -> Db1
annotator (X,y) -> Batch 2 -> Db1

Thank you.

hi @miladrogha!

If you know which examples you want each annotator to work on, is there a reason why you couldn't (before running Prodigy) create two separate files: batch1.jsonl and batch2.jsonl?

Also - do you absolutely need to run them simultaneously? Let's say you want to run ner.manual. The simplest approach would be run:

python -m prodigy ner.manual ner_dataset blank:en batch1.jsonl --label label1

Annotate and close the server. Then run:

python -m prodigy ner.manual ner_dataset blank:en batch2.jsonl --label label1

If you do need to run simultaneously, you could assign each a different port:

PRODIGY_PORT=8081 python -m prodigy ner.manual ner_dataset blank:en batch1.jsonl --label label1
PRODIGY_PORT=8082 python -m prodigy ner.manual ner_dataset blank:en batch2.jsonl --label label1

I put these on different ports but you could use one of these on the default port.

One thing to note - be careful that your dataset doesn't have duplicates across the two example sets. Also, if each of the two batches have multiple annotators, be sure to use unique session names, ideally that are created at the start.

Alternatively - to be even safer - I would recommend that you save annotations (at first) to two separate datasets. The reasoning is to avoid any edge case where the DB is pulling at the same the time. Then if you wanted them inside the same dataset, you can run db merge batch1,batch2 to combine two datasets into one.

Also FYI, that this post above is from 2019. While at first glance a lot of its content is the same, there have been a lot of big changes to Prodigy, namely our recent release of Task Routing. I don't recommend you need custom task routers, I mention this for other readers to make them aware of the more advanced ways we've developed to handle task routing.

1 Like

Thank you very much for your response @ryanwesslen. Absolutely helpful!

I wanted to see if there is an option to have two sessions simultaneously to save time on annotation tasks.

Also, I have the Prodigy on Heroku, so having two separate ports become challenging (I have two different workers, I guess... )

Considering all the options, I think I will go with two separate sessions for each group of annotations (with specific session names).

Cheers!