feed_overlap = False and AI assisted labeling

Xargonus · July 12, 2020, 7:45pm

Hi,

I want to use Prodigy for AI assisted labeling. Each image is pre-labeled with bounding boxes before being send to the user.

This all works well. but if I stop the session and then restart it with feed_overlap set to false, my model has to label all the images that are already manually labeled, even though they are never shown to the user because of the dataflow.

This result in a several minute loading page the first tam a laberer opens the new session and wasted computer resources.

Is there a way to avoid this ? A way to update the image stream only AFTER it is decided that the image is not in the database and should be shown to the user ?

ines · July 14, 2020, 9:43am

Hi! The exclude logic that decide whether an example needs to be labelled works by comparing the hashes of a task in the incoming stream to the hashes of the examples that are already in the database. So the examples need to be created – and if your recipe is creating them on the fly, it has to recreate them, because there's no way to know whether something changed in the code, model, in the source data etc.

If producing your examples is expensive, you could consider doing it as a preprocessing step that outputs JSONL and pre-hashes the examples (e.g. using set_hashes). So your recipe only needs to load JSON records and check their existing hashes against the current state of the database, which should be super fast.

Xargonus · July 14, 2020, 10:03am

Hi Ines, thank you for the reply.

So, in other words, it is not possible to do postprocessing on examples that pass the feed_overlap = False filter.

This sounds like something that should be supported, but it makes sense not to since the developer could also alter the image, resulting in a change of the hash.

ines · July 14, 2020, 10:31am

Ah no, that's not what I meant. This doesn't have anything to do with the feed_overlap filter – it's just that Prodigy needs to be able to tell whether an example your Python generator outputs is already in the database or not. And for that, the example needs to be created.

I'm not sure what could be postprocessed here, or maybe I'm misunderstanding the question?

Xargonus · July 14, 2020, 12:53pm

I create the stream using the Images loader. This stream is already enough to determine, whether an example already is in the database or not.

Then, using a generator, I enrich each image json (each example) with "spans" using an object detection algorithm.

If the first N images are already labeled, this results in the algorithm being run N times for nothing.

What I mean by postprocessing is being able to edit examples AFTER the feed_overlap filter has happened, so that It is only run on images that are not already in the database.

ines · July 15, 2020, 11:13am

In that case, you could just set the hashes right after you load your images, get all input hashes in the current dataset from the database and filter out the images with input hashes that are already in the dataset. For example, something like this:

db = connect()

def get_stream():
    stream = Images(source)
    hashed_stream = (set_hashes(eg) for eg in stream)
    input_hashes = db.get_input_hashes(dataset)
    filtered_stream = (eg for eg in stream if eg["_input_hash"] not in input_hashes)
    # etc.

Topic		Replies	Views
feed_overlap bug? done	7	1307	July 2, 2019
Prodigy shows examples already in DB when feed_overlap=True and using a named session server	4	802	July 3, 2020
Can’t set feed_overlap override in custom recipe usage	3	358	March 4, 2023
Option feed_overlap=false doesn't show expected behaviour usage , streams	3	1429	December 30, 2021
Feed overlap not working as expected usage , solved	16	2808	October 14, 2022

feed_overlap = False and AI assisted labeling

Related topics