Is it possible for me to control the entire active learning loop?

I am experimenting with writing my own active learning loop and integrating text annotation with user-initiated search from an interface like Kibana. Basically I want to customize of almost all your functionality, except I still want to use the existing Prodigy web UI because that’s really slick.

I know Prodigy has been designed to be very pluggable so I think this is possible, but I’m not sure how to do it. In particular the recipes appear to assume that all the data to be annotated exists in a single corpus whose contents are known in advance and through which Prodigy makes a single pass, whereas I’m imagining something more random access and user driven.

Is something like this possible? Maybe by using the API interface to corpora instead of JSONL files?

This is not necessarily true. Most use cases definitely involve loading data from a file or a single source, because that's the most common way people go about annotating their data. But in the end, the stream is just a Python generator that Prodigy keeps requesting batches of tasks from. How that batch is composed is up to you – so you could easily implement your own logic that takes previous user decisions into account, randomly adds data from different sources or uses other factors to determine what to send out for annotation. (Maybe you only want to annotate fun and light texts on Mondays and keep the difficult stuff for Wednesdays :wink: Prodigy itself is completely indifferent to that and will just ask you to annotate whatever your stream produces.)

In theory, you could go as far as customising the app.py to plug in your own logic and provide it via the endpoints that the web app interacts with. But I'm not sure this is really necessary here – if you start with a blank recipe and don't use any of the built-in models or active learning components, you can use all of Prodigy's scaffolding like the web app, web server, database and CLI but still be able to fully control what data is going in and what's done with the annotations you receive back.

The web app mostly interacts with Prodigy via two REST endpoints:

  • /get_questions – Get a batch of batch_size examples from the stream. Called on load and whenever the queue is running low.
  • /give_answers – Send a batch of annotated examples back. Called periodically when enough annotations are collected, or when the user hits "Save" manually.

On the recipe side, those are implemented via the following two components:

  • stream – A generator that yields examples based on any logic you need.
  • update – A function that receives a list of annotated examples and does something – e.g. updates a model, modifies the stream of examples based on the annotations, outputs stuff somewhere etc.

You can also start the recipe and Prodigy server programmatically from within a Python script. So your custom app could have the user make a search, click around (whatever you need), fetch some data for annotation, start Prodigy and have the user annotate it. You could even make your generator return a "fake" annotation task that tells the user to readjust their search after X examples (or once a certain distribution of annotation tasks is received back). If you view Prodigy as more of an abstract framework that streams data through a web application, I think there are a lot of creative solutions and use cases you can come up with :blush:

Yes. That's exactly what I want to do.

How exactly do I start Prodigy and the recipe programmatically? By calling app.py:server?

Oh, it's prodigy.serve, right?

Yes, that's the exposed helper function. Usage is currently very simple (because we weren't sure if people really needed it and how complex it'd have to be), so you'll need to pass in all recipe arguments in order as positional arguments, or None if they're not set. We might change this for a future release and make it a little nicer. You can also check out the implementation in prodigy/__init__.py.

You could also use a library like Fabric3 and write your own fabfiles with Prodigy commands and other stuff you want to trigger.

I’ve been thinking about it, and here’s what I want.

I want to write a web application (maybe in Flask, say). It does various tasks processing text documents. A region of its UI is Prodigy’s ner.make-gold interface. At any time my application can load a single spaCy document into this UI, the user can change annotations as they see fit, hit save, and the changed spaCy document is returned to my application.

Is this possible? I’m not sure if the Prodigy web application is more of a push or a poll model.

1 Like

As you know, I always love creative use cases of Prodigy so let’s see! Your application could probably work like this: user triggers action → your app starts Prodigy with the custom stream → Prodigy app loads in the Prodigy panel → user annotates, saves and closes Prodigy → repeat.

On each load, the Prodigy web app will make a GET request to the /project endpoint (fetching the meta data like the config), followed by a GET request to /get_questions, which delegates to the recipe’s stream.

Instead of re-starting the server, you could also just add more examples to the stream (i.e. whichever source your generator consumes the tasks from) and have the user reload the Prodigy app. (Tip: it sometimes helps to wrap your stream in a while True loop to make it infinite and filter out already annotated tasks. This way, the user will always see what’s left in the stream until it’s all annotated – even if they reload their browser in between.)

It’ll probably still be smoother if you use batches of at least a handful of examples at a time. You can experiment with a batch_size lower than the default 10, but if there’s a way to batch up some tasks for the user to annotate before they open the web app, this would definitely make things a lot easier. If you actually want to use an active learning approach, or use the annotations to perform any meaningful updates, you’ll likely want to batch them up anyways afterwards.