Struggling to create a multiple choice image classification

Hi! The it and tID indices make your code a little difficult to follow – but it looks like you’ve already solved the image choice part? Each example should have one or more "options" and each option should have an ID and a text. That should be all you need to make it render as an image with multiple choice options.

The other thing you’re trying to do is loop over the examples over and over again until every example is in the database. Maybe it helps to break this down into steps. Fundamentally, you want to do three things:

  1. Load your data and add the options. You can do this in the recipe, or once upfront and then save the data to JSONL (assuming you don’t want the options to change at runtime). You also want to assign hashes to each example so it’s easier to identify it later on (and so you don’t have to compar examples by things like the value of "image", which can easily get expensive).
def get_stream():
    stream = Images(source)
    for eg in stream:
        # add the options here if needed..
        eg = prodigy.set_hashes(eg)
        yield eg
  1. Create an infinite loop (usually done with while True) and in each loop, get a new stream and also get the hashes of the examples in the database. Once all examples are annotated, it’ll loop over the stream again and only show you exampes that aren’t yet in the database, and so on. If there’s any custom logic you want to use to decide whether an example should be sent out or not, you could also add that here.
from prodigy.components.db import connect

def get_stream_loop():
    db = connect()
    while True:
        stream = get_stream()
        hashes_in_dataset = db.get_task_hashes(dataset)
        for eg in stream:
            # Only send out task if its hash isn't in the dataset yet
            if eg["_task_hash"] not in hashes_in_dataset:
                yield eg
  1. Make sure you don’t get stuck in an infinite loop if there’s nothing to annotate anymore. One straightforward way to do this is to keep track of whether the previous loop sent something out. If there wasn’t anything sent out, you’ll know that all examples are in the database and can break the loop.
from prodigy.components.db import connect

def get_stream_loop():
    db = connect()
    while True:
        stream = get_stream()
        hashes_in_dataset = db.get_task_hashes(dataset)
        yielded = False
        for eg in stream:
            # Only send out task if its hash isn't in the dataset yet
            if eg["_task_hash"] not in hashes_in_dataset:
                yield eg
                yielded = True
        if not yielded:
            break

If your stream isn’t super huge, you could also consider converting it to a list (e.g. by calling list around it). This makes it easier to work with, because you can use len and you’ll know when all examples are annotated. But of course, that approach would make things very slow if the stream is very large.