Enabling more than one person to access Prodigy for content and image classification

usage
textcat
(Byron James) #1

Hi,

I’d like to allow people on the team; 5 of them, to access the training interface from a central server on our network. I’d prefer to add the interface to the content management workflow for material we’re working with on a daily basis to allow content to be categorized as it’s added to the system. Now, there may be a better way to do this and I’m open to suggestions so do let me know.

Many thanks.

Byron

0 Likes

(Ines Montani) #2

Could you share some more details on how you imagine this workflow to look like? And how much content is roughly added per day?

In general, there shouldn’t be a problem with having multiple people access the Prodigy app – and the new named multi-user sessions also let you assign the annotations to specific users, and choose whether to annotate each example once in total or once by each person.

0 Likes

(Byron James) #3

We’re re-developing a system that has about 10,000 text records and images; typically 1 image per record, so 10,000 images. We could assign someone to manually tag/categorise but there’s an opportunity to train a system to do it, albeit with ‘some’ level of human oversight.

There are two phases:
One, use the existing data to train both the text and image classification - 10,000 may be overkill, but it’s always good to have more than less.

Two, use the trained system to then handle the classification of images (and text) on the revised system, not honestly sure what the volume has been in the past; haven’t looked at the database to check timestamps on the data (we just got it) - guessing 5 to 10/day based on similarities with other systems.

The concern is accuracy and consistency, which is where prodigy comes in and a trained model. It ‘might’ look like overreach, but people are unreliable - and mushy.

B

0 Likes

(Ines Montani) #4

Thanks for sharing more details!

It sounds like for your use case, having a solid experiment workflow will be very valuable: you want to monitor how your model is improving, intervene if it’s not and have incremental, reproducible steps that build up your dataset.

One idea for a workflow could be: Every day, week etc. you automatically export the new additions from your CMS and save them in an easy-to-read format, e.g. Prodigy’s JSONL. For each annotator, you then (automatically or manually) start up an instance of Prodigy on a separate port with the data and set it up to save to a separate dataset, like week32_annotator1. In the beginning, you may want to ask the annotators to label the data by hand and then use that to train the model. But later on, you could slowly transition to a workflow that has the annotators review the model’s predictions. In that case, you’d run your model over the data, add the label predicted by the model and then use a binary recipe that lets you annotators accept or reject (which should be super fast, too).

Once the annotation is done (e.g. if every incoming example is in every annotated dataset), you can get the data and run some metrics over it. Prodigy assigns hashes to each example, which makes it easy to find identical examples across datasets. For example, you want to check whether annotators agree and which examples are “controversial”.

This is especially important in the beginning while you’re still figuring out your process and label scheme. Maybe it turns out that one label is particularly difficult to assign, so you might decide to revise the label scheme, or provide better annotation guidelines. That kinda stuff always sounds simple and trivial, but it’s actually one of the biggest bottlenecks we’ve seen, along with making reasonable choices of what to train.

1 Like