Thanks for your questions and for testing Prodigy!
Yes, we currently don't have an image model built in – but we're working on an implementation of that (probably using PyTorch). So for now, you'll need to plug in your own image model. You can probably use an existing open-source implementation – the most important thing is that a model exists, and that it predicts something (even if it's bad) and that it can be updated with examples.
In order to annotate with a model in the loop, you need the following:
- an
update
function that updates the model. It receives a list of annotated examples –dictionaries of the annotation tasks with an answer
key that's either "accept", "reject" or "ignore". The update function usually returns the loss.
- a
progress
function that receives the loss and returns a float indicating the annotation progress (in theory, this is not necessary – but there's currently a small bug in Prodigy which means that a custom update function also needs a custom progress function. This will be fixed in the next release).
- a function that yields annotation examples in Prodigy's format, usually referred to as the
stream
. Your stream of examples can be wrapped by the model, so you'll be able to resort and rescore the stream, based on the annotated tasks you receive. As the model learns things, you can show the annotator different examples.
@prodigy.recipe('image_recipe')
def image_recipe(dataset):
model = load_my_model()
stream = load_my_images() # [{'image': 'a.jpg'}, {'image': 'b.jpg'}]
return {
'dataset': dataset, # ID of dataset to store annotations
'stream': model(stream), # stream of examples
'update': model.update, # update the model
'progress': model.progress, # annotation progress
'view_id': 'image' # annotation interface
}
You can then load your recipe like this:
prodigy image_recipe my_dataset -F recipe.py
You can find more information on the custom recipes in the PRODIGY_README.html
you can download with Prodigy. Alternatively, you can also just stream in examples "statically", annotate them in order and then use the collected data to train your model from scratch. In this case, your recipe would only need a dataset
, a stream
and the view_id
.
This is something we'll be working on as part of the Prodigy annotation manager, which will be available as an addon, and include functionality to manage multiple annotators.
One thing that's important to keep in mind: If you're collecting annotations with a model in the loop, your annotation session will always be somewhat stateful. The Prodigy server holds an instance of the model you're training in memory, and when you annotate a batch of examples, that model is updated and the stream of examples is resorted and rescored. Prodigy then returns a new batch of examples based on the updated predictions. If you have multiple users accessing the same session and annotating at different speeds etc., there's no easy way to reconcile the annotations, prevent the same task from getting annotated twice (and overwritten in the same dataset), and make sure the model learns what it's supposed to.
In the best case scenario, all annotators will be making similar decisions, and the model will be updated consistently. (We've tested this with NER and a bunch of people accessing the same instance of the Prodigy app – this worked okay.) In the worst case scenario, annotator A will move the model in one direction, and annotator B in a different one – making the results pretty useless. So for now, I'd recommend using one session and dataset per annotator, and combining them later for training.
This is a good idea!
The nice thing about Prodigy is that it's configurable with code and simple Python functions. So if you can load your images from AWS in Python, you can use them in Prodigy. You can find an example of the expected annotation task format in the README. Essentially, your loader needs to return an iterable of dictionaries that looks like this:
stream = [{'image': 'file.jpg', 'label': 'SOME LABEL'}, {'image': 'file2.jpg', 'label': 'SOME LABEL'}]
The image
can be a local file path, a URL (if your bucket is not public, you could also append your AWS access key) or a base64-encoded data URI (nice for testing, but not always a good idea for large projects, as you're essentially storing the entire image data in your database – this can get really big).
An idea to explore might be the choice
interface. If the tasks and/or the options contain an image, they'll be rendered in the app. An example task could look like this:
{
"image": "reference.jpg",
"label": "RELATED",
"options": [
{"id": 1, "image": "a.jpg"},
{"id": 2, "image": "b.jpg"},
{"id": 3, "image": "c.jpg"}
]
}
This would give you a reference image, plus three images to choose from. Just tried it locally with dummy data, and the result looks like this:
To use multiple-choice instead of single choice, you can set "choice_style": "multiple"
in the 'config'
returned by your recipe, or in your prodigy.json
. When you select one or more options, their id
s will be added to the annotated ask as "selected": [1, 3]
.
Sorry for the long info dump – I hope this is useful to you. Let us know if you have any questions!