Firstly, awesome job. I am really impressed. This is the first labelling tool which actually empowers the users who are most likely to be skilled Software / AI Engineers themselves. I am fed up with the lack of flexibility with the SaaS solutions out there.
I am wondering, what would be involved to setup a pipeline to take a list of images for classification (in this case 2 classes or maybe a 3rd "bad image" class), and have multiple labellers go through the same set to collect metrics on consensus. I am facing a difficult classification task and I need to rapidly build consensus on my large dataset so I can see if there is sufficient signal in the data, or certain tough decision boundary images which should be left out of the model until I can train it accurately on simpler data.
Optimally I would love to deploy to Google Cloud, back this with a db, and point the task at a bucket of images. Then invite a bunch of labellers with a link. Afterwards I can download the data and run some queries to pick out certain groups of the data and process further.
I understand some stuff might need to be coded myself but I am trying to figure out what parts of the above you anticipate are mostly out of the box or available via configuration and what I would need to make myself. I was pretty close to just building this myself, however after seeing the awesome work you guys have done, I get the feeling this might help me bootstrap things really quickly.
Thanks, looking forwards to hearing back from you!
Thanks for the kind words, and I'm glad the concept of the tool resonates with you! We hate having to "program" in YaML or by clicking through a web interface, so we wanted to make a tool where the scripting was front and centre.
Focusing on the deployment aspects first:
Prodigy will be exactly as easy to deploy this way as a "hello world" Flask app. Prodigy itself just hosts the REST API, including the single-page app. So whatever steps you'd normally do (e.g. add a reverse proxy, add a domain name with https, etc) will be there as well.
For the database, you can configure Prodigy to store to any SQL database, by providing config either over environment variables or in the prodigy.json file. You can also configure from the recipe instead. We use the Peewee ORM, so again it should be exactly the same sort of process as setting up DB connectivity for a normal app. The default is to just use SQLite...I think actually if you configure the DB to a persistent disk, you could just use that? I think it'll be fine.
You'll need to do the inter-annotator agreement stuff yourself, but otherwise I think everything should be basically built-in. You'll end up with a couple of Python functions for custom recipes just for convenience probably too. The data format is just jsonl. Your use-case is very much the sort of thing we were thinking of, so I don't think you'll have much trouble.
One feature you'll need is being able to identify which annotations were done by which users. This is referred to as "multi-user sessions" in Prodigy. You can give each annotator a version of the URL differentiated by query parameter. Here's the section in the docs on this:
This update was shipped in preparation of the upcoming Prodigy Scale, a full-featured, standalone application for large-scale multi-user annotation project powered by Prodigy.
As of v1.7.0, Prodigy supports multiple named sessions within the same instance. This makes it easier to implement custom multi-user workflows and controlling the data that's sent out to individual annotators.
To create a custom named session, add ?session=xxx to the annotation app URL. For example, annotator Alex may access a running Prodigy project via http://localhost:8080/?session=alex. Internally, this will request and send back annotations with a session identifier consisting of the current dataset name and the session ID – for example, ner_person-alex. Every time annotator Alex labels examples for this dataset, their annotations will be associated with this session identifier.
The "feed_overlap" setting in your prodigy.json or recipe config lets you configure how examples should be sent out across multiple sessions. By default (true), each example in the dataset will be sent out once for each session, so you'll end up with overlapping annotations (e.g. one per example per annotator). Setting "feed_overlap" to false will send out each example in the data once to whoever is available. As a result, your data will have each example
labelled only once in total.
As of v1.8.0, the PRODIGY_ALLOWED_SESSIONS environment variable lets you define comma-separated string names of sessions that are allowed to be set via the app. For instance, PRODIGY_ALLOWED_SESSIONS=alex,jo would only allow ?session=alex and ?session=jo, and other parameters would raise an error.```
Wow that was a comprehensive reply. Thank you for taking the time to answer my questions.
It looks like I could have something up and running in a day. I have already started writing some logic for calculating consensus statistics with Pandas so I could build that into the code somewhere.
Unfortunately, I am being told we need to keep using the over priced LabelBox license we have for the moment, but hopefully in a few weeks I can look at getting this purchased and start developing out a proper tailored pipe line for our needs.
Is there much in terms of people sharing modular open source code for different tasks? It feels like this is the perfect project to allow a plugin architecture that people can share labelling task development on.
Thanks again for the solid response and i'm now actively thinking of how I can get this into our Data Pipeline ASAP!
No problem! Glad what I wrote was helpful. I'm sure other solutions have their benefits as well, especially ones which are more focussed on image tasks. At the moment some of the image functionality is a bit under-developed in Prodigy relative to text.
We've been lucky enough to have some users contribute recipes, yes. There's a repo for them here:
This is one of the reasons we've kept the prices lower than most professional software tools, and why we do the support on the forum --- we definitely learned from our open-source work on spaCy how helpful it is to have a larger community of users. It also means Prodigy is relatively more stable, because there are lots of people running it, so bugs get surfaced quicker.