✨ Prodigy Annotation Manager Update // Prodigy Scale // Prodigy Teams

Many of you have been asking about how to scale your Prodigy projects – including how to manage more annotators, how to keep multiple feeds running, and how to make sure your annotators stay in agreement. These are challenging problems that are outside of the scope of the Prodigy software itself, so we've always planned an extension product that helps you use Prodigy in larger products. We've been quietly working away on this for some time, so we're excited to finally announce our progress! The new product, Prodigy Teams, is almost ready for the first beta testers. Here's what you can expect.

We started really focusing on Prodigy Teams around April this year. To ship it to you as quickly as possible without neglecting spaCy, Prodigy and our other projects, we've teamed up with @JustinDuJardin, who some of you might already know from the Prodigy forum and community. Justin was actually one of the very first people to order Prodigy when it went on sale in December. He brings a well-rounded mix of experience developing reliable systems from game engines to web applications.

Justin has spent the past two weeks with us in Berlin – here's us after a productive day of work on Prodigy Teams:

Our first plan for Prodigy Teams was a small add-on you could install alongside the main app, maybe with a service layer to create shareable URLs, but as we've continued to welcome more users to Prodigy, we saw that a simple add-on wouldn't be able to solve enough of the problem. Each Prodigy task is a web service, so scaling Prodigy to a large team of annotators means keeping lots of little services running, managing authentication to restrict access to sensitive data, running periodic quality control, and reconciling conflicting annotations to create a more reliable gold-standard. A small add-on could solve one or two of these problems, but solving them all together required a different approach.

One thing we knew we couldn't compromise on was data privacy. Data worth labelling is usually data that can't be shared, which is why privacy has always been a key feature of Prodigy – your data never needs to leave your server. But how could we build the same guarantee into Prodigy Teams, without making the app difficult to install and difficult to use? To solve this problem, we've split the application into two parts. The part that needs to read your data runs entirely on your infrastructure, so once again your text, images or other source data never needs to leave your servers.


Disclaimer: This flowchart is an early draft I created, based on Frederique's illustrations.

To pull this off, we've made use of the excellent open-source tools provided by Hashicorp, specifically Nomad and Consul. Prodigy Teams gives you a simple web interface to configure new annotation tasks, administer annotation projects and trigger training runs. Everything that touches your texts, images and other data, runs on your own infrastructure so that privacy-sensitive information never leaves your control. The web app is loaded from our servers, so it's always available, and always up-to-date – but your annotators send and receive tasks directly to and from Prodigy, running on your cluster.


FAQ

Do I have to upload my data to your servers?

No. Your data stays on your machines and your annotators will communicate directly with your cluster. Our servers only provide the web application and don't store any text, images or other privacy-sensitive information.

What data do you store on your servers?

We'll only store the minimum data needed to provide the service. This includes user accounts (with authentication provided by Auth0), metadata of projects you create in the app (ID, description, settings), as well as general stats related to projects, tasks and annotators. For statistics related to annotations, we'll only store the task hash and the answer, not the original data.

What do I need to run the cluster?

A fully automated installer will be provided, making the cluster easy to configure. For the alpha and beta, we'll first be focusing on the installer for Google Compute Engine. All resources will be run on your own account. We're also working on a standalone Docker image that you can run on any server.

I already use Prodigy – what does this mean for me?

Prodigy Teams will make it easy to use your current workflows, and build even more complex pipelines interactively. You'll be able to import your existing datasets and models, and export data from Prodigy Teams to use with Prodigy or other tools.

Prodigy will still be available as a downloadable developer tool alongside Prodigy Teams. In fact, the Prodigy Teams client running on your cluster will also use the prodigy library under the hood. This means that new interfaces and bug fixes will always be shipped to all users, no matter if you're using Prodigy Solo or Prodigy Teams.

How much will it cost?

Naturally, we want the pricing to scale with the size of your team. We're targeting pricing at around 2-5% of data science salary spending.


Apply for the alpha and beta

If you're interested in testing Prodigy Teams before its official release, you can fill in this form and help us plan our private alpha and beta testing. We are looking for a small number of testers who we can interact with closely, so please understand that we can't accept everyone (even though we'd love to).

Thanks for your support!
Ines, Matt & Justin

:bar_chart: Fill in the form here

9 Likes

This is exciting :+1:!

Our license renewal period is coming soon, and we are trying to determine whether to have both Prodigy and Prodigy Scale. For Prodigy, there haven’t been any updates since October and most of the limitations of Prodigy seems to be solved by Prodigy Scale (namely session/dataset management, etc.).

Couple Qs:

Is there a discount for current Prodigy users when Prodigy Scale gets released?

What would be the benefits of having both Prodigy and Prodigy Scale? Let’s say we have custom recipes, can they be run on Prodigy Scale? Or maybe it can be run on Prodigy, but we can still use Prodigy Scale for dataset management?

Thanks

1 Like

It has been 7 months from the last announcement, so, I’m wondering, what is the current status of the Annotation Manager and Prodigy Scale.

Thanks for your patience on this, and sorry for the silence. I’m sure you can all sympathise that giving estimates is hard. It took a little longer than we expected to get spaCy v2.1 finished, which meant Ines and I had a little less time to devote to Prodigy Scale. As well as us, we now have two other developers working full time on it.

The other thing that’s made this look like it’s taking longer is we decided to make the initial beta very small. In the first beta, all users are working on our cluster, and sending their annotations to the same database. This is a very different configuration from how we’ll have things at launch, where everyone will be running their own cluster, launched on their own infrastructure. We decided that testing this setup where lots of people are on the same cluster would require some specific engineering, and since it’s not a setup that’s actually important for real use-case, we decided to avoid getting side-tracked.

We also wanted to avoid launching a beta that was too far different from how the app will really look and feel. Supporting beta users is time-consuming, and makes it harder to change things quickly. We’re also conscious of the fact that we only get one first impression, so we want to make sure we get the most useful feedback we can. So long as there’s a lot of things we still want to do, we want to keep the beta very small.

@plusepsilon Sorry I’m so late to this! I missed your questions earlier somehow.

Possibly, but possibly not. We haven’t finalised the pricing details yet. Sorry I can’t be more specific on this!

You’ll be able to run custom recipes on Prodigy Scale, yes.

Personally, I’ll still be using Prodigy for a lot of development tasks, even with Prodigy Scale. If I’m going to do a few hours of labelling to answer some question to my own satisfaction, it’s very useful to have the server and the scripts all local. For instance, I might be doing some error analysis, or once-off data cleaning, or checking whether some idea is viable. This week I even make a little spaced-repetition recipe to help me study German: https://gist.github.com/honnibal/5c54ad3ddbb0504c0b25319b53d4c681

You might also find the stand-alone version of Prodigy useful when developing recipes for Prodigy Scale.

1 Like

Hi,

I was wondering when can we expect any updates on the Annotation Manager?

and Prodigy Scale in general?

It is very exciting indeed.

In my team, we have created our own proprietary tool for entity and dependency human labeling. It is working in a GCP environment, and honestly, there is today no equivalent on the market. But it costed us big money to achieve this good result.

Now, I have also running in parallel my own projects. For copyright reasons and because I prefer to compartmentalize my work, I have just bought a Prodigy license for this duty. And I am very pleased to see that a Prodigy Scale is coming soon.

It will be very interesting to use it and to compare with our powerful but expensive proprietary tool.

Good luck for the last QC :slight_smile:

Hi Ines and Honnibal,

I have worked with Prodigy on an NLP project before and since we only had one annotator, we were fine using a single instance of Prodigy. However, I'm currently working with a computer vision startup and I think we can really take advantage of Prodigy's features for classification of our images. But the number of images is in the order of 100K's and without an annotation manager solution, it would be very difficult to handle this job. As you can imagine, it can be a make or break for us to work with Prodigy or other annotation tools.
1- Can you please share your timeline on Prodigy Scale's launch?
2- Do you still accept beta customers?
Thanks!