Deploying Prodigy behind AWS Lambda


I'm used to deploying my microservices behind AWS Lambda because of the scalability and cost-control it gives. I haven't been able to run Prodigy behind Lambda, and I'm going to explain why.

Why this matters

If you can't deploy Prodigy behind a FaaS, it greatly limits its scaling potential at least for the web/inference parts of the app.
The training part could (arguably ?) be better-suited behind some sort of CRON-based PaaS (e.g. AWS Batch), although with the new 10GB RAM / 6 vCPU Lambdas most "small" datasets should fit easily in Lambda as well.

There are "half-baked" solutions (such as AWS Fargate or equivalent PaaS solutions), but it seems the only trivial way to deploy Prodigy right now is in a serverful fashion (e.g. on AWS EC2 or ECS), in which case it begs the question : what server size should I choose ? how will that scale with my usage and data growth ?

The better solution would be to be able to deploy Prodigy in a serverless architecture.

What I tried

From what I understand, Prodigy pretty much has to run via its CLI : prodigy

I dug a bit into the file inside the wheel and I see that on top of the FastAPI App (which could easily be served behind Lambda), you've defined some mandatory controller and config logic.

That means that catching the app (e.g. from import app) and somehow running it behind something other than uvicorn will fail because the DB won't get initialized properly... shame, that seemed like a good try :slight_smile:

Is there a proven way to run the whole Prodigy app (or even just the REST API, which is what I mostly care about) behind a different Python webserver than uvicorn ?

Hi! That's not necessarily true – you can also start the web server from Python. However, it's true that Prodigy always needs to start a web server.

The reason this doesn't work is that a Prodigy process is inherently stateful and defined by the recipe script, that orchestrates the workflow and provides the stream, settings and lifecycle callbacks, plus whatever else you choose to run (machine learning libraries and models etc.). So the recipe script needs to be in charge of putting together the state that you serve and keeping that state, so you can update models in the loop.

The web app itself is pretty lightweight (a couple of static files) and served together with the back-end, so you're able to completely isolate every instance of Prodigy that you run. Pretty much everything important that's going on is happening on the server side.

Prodigy itself is a very lightweight Python app so the server and scalability questions pretty much all come down to the machine learning libraries and models you use. spaCy is pretty efficient here, but if you're using heavier models that require more memory, you'd have to provision your machine accordingly. All data you annotate can be read in as a stream, so you don't have to keep it all in memory and you can even provide it via an external storage or API if needed.

I wouldn't worry too much about data growth – you can plug in a MySQL or Postgres database and even if you end up with millions of annotations (which is a lot), this is nothing out of the ordinary for a modern database. Even if you used a flat file SQLite DB, you'd maybe be looking at 10gb on disk.

1 Like

I see your point in processes such as ner.teach : no way that holds in a serverless environment unless you could somehow save the model in a bucket or Redis (as opposed to keeping it in-memory)

For something like ner.manual though, there doesn't seem to be much statefulness other than querying the DB like any other REST API : couldn't that hold into a Lambda ?

The reason I do worry is because my initial intention was to expose some of the labeling recipes to a B2C app and not just as an internal, staff-only task. I was planning on using the Prodigy REST API for that and take advantage of the ner.manual or even ner.teach to process the labeling tasks in an effective way.

Given that I can't scale for sure that server, if hundreds of customers pop on my website and start labeling their stuff, maintaining high-availability is gonna become a nightmare :sob:

Perhaps I should keep the Prodigy stuff "staff-only", and just expose a simple <select> to my users to label their stuff directly into my API, which is indeed serverless... but then I lose some of the cool features that Prodigy offers.

That's pretty much where I'm at :sweat_smile:

What do you think ?

No, that's not really true – Prodigy recipes are Python functions that orchestrate the annotation workflow. For instance, ner.manual uses a spaCy model to preprocess the text, auto-suggest labels and match patterns in the incoming stream. Streams are Python generators and processed in batches, so the Python process keep working to queue up new examples. This is where pretty much everything happens – the web app is just a lightweight static layer on top.

Prodigy also needs to be able to consider the recipe script a black box because anything could be happening in there. Machine learning models, preprocessing, conditional logic that decides how to configure the task, label schemes and UI, and so on. Under the hood, custom recipes work just like the built-in workflows.

This type of use case is something that wouldn't be covered by the Prodigy license terms. You can make Prodigy available to your team or annotators you work with, but you couldn't host it for everyone on the internet, or as part of a commercial product.