Struggling to create a multiple choice image classification

koAlech · March 25, 2019, 11:05pm

Basically, I want to create a multiple choice image classification almost identical to this demo: Prodigy Demo

I realized that this requires writing a custom recipe and I wrote something simple enough:

gist.github.com

https://gist.github.com/koAlech/1bfeebbef7117d0d2530ff61f2dc1a5e

multi_image_classification.py

# coding: utf8
from __future__ import unicode_literals

import prodigy
from prodigy.components.loaders import Images
from prodigy.util import split_string
from prodigy.util import set_hashes
from prodigy.components.db import connect
import itertools

This file has been truncated. show original

I have one super frustrating issue where tasks are not being shown on page refresh in the same session. I have literally searched throughout the support site and found a dozen posts on this specific issue but just couldn't figure out how to solve it. Just to list a few:

I tried implementing an infinite stream with hash comparison but I just don't seem to make it work
Could someone share his custom recipe with a solution to this page refresh issue?
Or possibly share the custom recipe used for the multiple choice image demo?

Thanks,
Alik

ines · March 26, 2019, 11:13am

Hi! The it and tID indices make your code a little difficult to follow – but it looks like you’ve already solved the image choice part? Each example should have one or more "options" and each option should have an ID and a text. That should be all you need to make it render as an image with multiple choice options.

The other thing you’re trying to do is loop over the examples over and over again until every example is in the database. Maybe it helps to break this down into steps. Fundamentally, you want to do three things:

Load your data and add the options. You can do this in the recipe, or once upfront and then save the data to JSONL (assuming you don’t want the options to change at runtime). You also want to assign hashes to each example so it’s easier to identify it later on (and so you don’t have to compar examples by things like the value of "image", which can easily get expensive).

def get_stream():
    stream = Images(source)
    for eg in stream:
        # add the options here if needed..
        eg = prodigy.set_hashes(eg)
        yield eg

Create an infinite loop (usually done with while True) and in each loop, get a new stream and also get the hashes of the examples in the database. Once all examples are annotated, it’ll loop over the stream again and only show you exampes that aren’t yet in the database, and so on. If there’s any custom logic you want to use to decide whether an example should be sent out or not, you could also add that here.

from prodigy.components.db import connect

def get_stream_loop():
    db = connect()
    while True:
        stream = get_stream()
        hashes_in_dataset = db.get_task_hashes(dataset)
        for eg in stream:
            # Only send out task if its hash isn't in the dataset yet
            if eg["_task_hash"] not in hashes_in_dataset:
                yield eg

Make sure you don’t get stuck in an infinite loop if there’s nothing to annotate anymore. One straightforward way to do this is to keep track of whether the previous loop sent something out. If there wasn’t anything sent out, you’ll know that all examples are in the database and can break the loop.

from prodigy.components.db import connect

def get_stream_loop():
    db = connect()
    while True:
        stream = get_stream()
        hashes_in_dataset = db.get_task_hashes(dataset)
        yielded = False
        for eg in stream:
            # Only send out task if its hash isn't in the dataset yet
            if eg["_task_hash"] not in hashes_in_dataset:
                yield eg
                yielded = True
        if not yielded:
            break

If your stream isn’t super huge, you could also consider converting it to a list (e.g. by calling list around it). This makes it easier to work with, because you can use len and you’ll know when all examples are annotated. But of course, that approach would make things very slow if the stream is very large.

koAlech · March 29, 2019, 11:05am

Finally managed to make the infinite stream work.
Thank so much @ines!

Last issue I’m trying to resolve now is supporting several annotators on the same stream with different sessions. Comparing with the database means that once an annotator saves his annotations, it makes them unavailable for the others. Trying to figure out how I can enhance this if to check also the session it was annotated by:

if eg["_task_hash"] not in hashes_in_dataset:

Thought maybe I could add as an input_keys or task_keys (not sure I understand the difference between the two) the session ID to prodigy.set_hashes() but I don’t know how to do it.

A last resort possibility would be to use separate datasets and have each annotator run on a different dataset.

ines · March 30, 2019, 12:03pm

If you're using the new named multi-user sessions and set "overlap": False, the stream will be sent out to the annotators so that no examples gets annotated twice. I'm not sure if there are any unintended side-effects when combining it with custom logic, though.

But you might actually prefer working with separate datasets here, since it also makes it much easier to debug things (at least if you have a small-ish number of annotators). When a recipe starts, you'll then immediately know which annotator it is, and none of that will change throughout your custom logic.

Those are the keys that are already present on the task and influence how those hashes are generated. The input hash describes the original input (text, image) and the task hash the exact question (input plus things like, pre-highlighted spans etc). This is less important for fully manual tasks, but if you're running a recipe like ner.teach, you need a distinction between the initial incoming example (input) and the particular question with an entity suggested by the model (task). You also want the user to be able to see different questions about the same text – but not the same question on the same text.

By default, the input and task hashes are generated using Prodigy's default properties like "text" or "image". If you're using a custom recipe and custom data that specifies the text as "raw_text", you could add that to the task keys to make sure that it's taken into account when the hash is created.

Topic		Replies	Views
Image classification (choice) - Duplicated images image , solved	8	1695	May 16, 2019
Issue in multi-session mode: duplicated annotation tasks and different order? enhancement , done , streams	19	2778	May 28, 2020
Image Classification - annotating labels usage , image , solved	10	2154	April 17, 2019
"No tasks available" on page refresh usage , custom , solved	5	4376	December 27, 2018
Issue with multi-user session multi-user	6	802	February 8, 2023

Struggling to create a multiple choice image classification

Related topics