Many of you have been asking about how to scale your Prodigy projects – including how to manage more annotators, how to keep multiple feeds running, and how to make sure your annotators stay in agreement. These are challenging problems that are outside of the scope of the Prodigy software itself, so we’ve always planned an extension product that helps you use Prodigy in larger products. We’ve been quietly working away on this for some time, so we’re excited to finally announce our progress! The new product, Prodigy Scale, is almost ready for the first beta testers. Here’s what you can expect.
We started really focusing on Prodigy Scale around April this year. To ship it to you as quickly as possible without neglecting spaCy, Prodigy and our other projects, we’ve teamed up with @JustinDuJardin, who some of you might already know from the Prodigy forum and community. Justin was actually one of the very first people to order Prodigy when it went on sale in December. He brings a well-rounded mix of experience developing reliable systems from game engines to web applications.
Justin has spent the past two weeks with us in Berlin – here’s us after a productive day of work on Prodigy Scale:
Our first plan for Prodigy Scale was a small add-on you could install alongside the main app, maybe with a service layer to create shareable URLs, but as we’ve continued to welcome more users to Prodigy, we saw that a simple add-on wouldn’t be able to solve enough of the problem. Each Prodigy task is a web service, so scaling Prodigy to a large team of annotators means keeping lots of little services running, managing authentication to restrict access to sensitive data, running periodic quality control, and reconciling conflicting annotations to create a more reliable gold-standard. A small add-on could solve one or two of these problems, but solving them all together required a different approach.
One thing we knew we couldn’t compromise on was data privacy. Data worth labelling is usually data that can’t be shared, which is why privacy has always been a key feature of Prodigy – your data never needs to leave your server. But how could we build the same guarantee into Prodigy Scale, without making the app difficult to install and difficult to use? To solve this problem, we’ve split the application into two parts. The part that needs to read your data runs entirely on your infrastructure, so once again your text, images or other source data never needs to leave your servers.
Disclaimer: This flowchart is an early draft I created, based on Frederique’s illustrations.
To pull this off, we’ve made use of the excellent open-source tools provided by Hashicorp, specifically Nomad and Consul. Prodigy Scale gives you a simple web interface to configure new annotation tasks, administer annotation projects and trigger training runs. Everything that touches your texts, images and other data, runs on your own infrastructure so that privacy-sensitive information never leaves your control. The web app is loaded from our servers, so it’s always available, and always up-to-date – but your annotators send and receive tasks directly to and from Prodigy, running on your cluster.
Do I have to upload my data to your servers?
No. Your data stays on your machines and your annotators will communicate directly with your cluster. Our servers only provide the web application and don’t store any text, images or other privacy-sensitive information.
What data do you store on your servers?
We’ll only store the minimum data needed to provide the service. This includes user accounts (with authentication provided by Auth0), metadata of projects you create in the app (ID, description, settings), as well as general stats related to projects, tasks and annotators. For statistics related to annotations, we’ll only store the task hash and the answer, not the original data.
What do I need to run the cluster?
A fully automated installer will be provided, making the cluster easy to configure. For the alpha and beta, we’ll first be focusing on the installer for Google Compute Engine. All resources will be run on your own account. We’re also working on a standalone Docker image that you can run on any server.
I already use Prodigy – what does this mean for me?
Prodigy Scale will make it easy to use your current workflows, and build even more complex pipelines interactively. You’ll be able to import your existing datasets and models, and export data from Prodigy Scale to use with Prodigy or other tools.
Prodigy will still be available as a downloadable developer tool alongside Prodigy Scale. In fact, the Prodigy Scale client running on your cluster will also use the
prodigy library under the hood. This means that new interfaces and bug fixes will always be shipped to all users, no matter if you’re using Prodigy Solo or Prodigy Scale.
How much will it cost?
Naturally, we want the pricing to scale with the size of your team. We’re targeting pricing at around 2-5% of data science salary spending.
Apply for the alpha and beta
If you’re interested in testing Prodigy Scale before its official release, you can fill in this form and help us plan our private alpha and beta testing. We are looking for a small number of testers who we can interact with closely, so please understand that we can’t accept everyone (even though we’d love to).
Thanks for your support!
Ines, Matt & Justin