I want to use Prodigy for AI assisted labeling. Each image is pre-labeled with bounding boxes before being send to the user.
This all works well. but if I stop the session and then restart it with feed_overlap set to false, my model has to label all the images that are already manually labeled, even though they are never shown to the user because of the dataflow.
This result in a several minute loading page the first tam a laberer opens the new session and wasted computer resources.
Is there a way to avoid this ? A way to update the image stream only AFTER it is decided that the image is not in the database and should be shown to the user ?
Hi! The exclude logic that decide whether an example needs to be labelled works by comparing the hashes of a task in the incoming stream to the hashes of the examples that are already in the database. So the examples need to be created β and if your recipe is creating them on the fly, it has to recreate them, because there's no way to know whether something changed in the code, model, in the source data etc.
If producing your examples is expensive, you could consider doing it as a preprocessing step that outputs JSONL and pre-hashes the examples (e.g. using set_hashes). So your recipe only needs to load JSON records and check their existing hashes against the current state of the database, which should be super fast.
So, in other words, it is not possible to do postprocessing on examples that pass the feed_overlap = False filter.
This sounds like something that should be supported, but it makes sense not to since the developer could also alter the image, resulting in a change of the hash.
Ah no, that's not what I meant. This doesn't have anything to do with the feed_overlap filter β it's just that Prodigy needs to be able to tell whether an example your Python generator outputs is already in the database or not. And for that, the example needs to be created.
I'm not sure what could be postprocessed here, or maybe I'm misunderstanding the question?
I create the stream using the Images loader. This stream is already enough to determine, whether an example already is in the database or not.
Then, using a generator, I enrich each image json (each example) with "spans" using an object detection algorithm.
If the first N images are already labeled, this results in the algorithm being run N times for nothing.
What I mean by postprocessing is being able to edit examples AFTER the feed_overlap filter has happened, so that It is only run on images that are not already in the database.
In that case, you could just set the hashes right after you load your images, get all input hashes in the current dataset from the database and filter out the images with input hashes that are already in the dataset. For example, something like this:
db = connect()
def get_stream():
stream = Images(source)
hashed_stream = (set_hashes(eg) for eg in stream)
input_hashes = db.get_input_hashes(dataset)
filtered_stream = (eg for eg in stream if eg["_input_hash"] not in input_hashes)
# etc.