This is likely because the example only includes 3 annotation tasks, i.e. only one batch. When you load Prodigy for the first time, it’ll fetch the first batch from the server. On second load, there’s no second batch available anymore, so you see the “No tasks available” message.
To avoid this and make Prodigy keep yielding tasks until they’re all annotated, you can wrap the streaming logic in a while True loop:
for task in stream:
Maybe this should be mentioned somewhere in the docs – I left it out of that particular example to keep it simple and not distract from the other, more important aspects of the workflow.
I have the same exact issue with image.manual but the provided snippet does not solve the problem for me (unless I’m using it wrong). When the page is refreshed a few times and it says “No tasks available”, the input stream to this function (which would come from the preprocess.fetch_images or loader.get_stream in case of image.manual) actually has reached to its end. This means that get_stream in the snippet, just keeps looping in the while True loop.
Now, if I change the snippet to re-create the stream when it’s fully traversed as follows, it does not stop even after all of the images are traversed.
def wrap_stream(source, api, loader):
stream = get_stream(source, api=api, loader=loader, input_key='image')
stream = fetch_images(stream)
for task in stream:
Still trying to fix this, but also would appreciate any pointers.
Ah yes – sorry, I forgot to actually add logic to refresh the stream generator.
At any time in your loop, you can break – but you have to decide when the stream is actually "done". When you're sending out the new questions, you don't always know what answers were "lost" and what answers are still being answered. Maybe the annotator is just taking a while and still has the questions in the queue – in that case, you might not want to send them out again immediately.
A very basic solution would be to check the hashes of the incoming examples against the hashes already present in the current dataset and break if all hashes are covered. You might find this thread useful, which explains a lot of this in more detail and has some examples:
Thank you for the quick response and the reference to the thread, very informative!
The approach of checking the hashes in the incoming stream against the hashes in the dataset makes sense to me. However, there is still one confusion:
I’m using prodigy.components.loaders.get_stream to create my stream. get_stream starts from the unannotated examples when the prodigy process is first started (meaning that it does not show the already annotated images again when I restart the process). That’s why I was thinking that it already includes the hash checking logic inside. Is that actually the case?
Because of that behaviour, I was hoping that by simply recreating the stream by calling get_stream, I’ll be able to fetch the unannotated examples. What am I missing here?
Btw, one small thing I forgot to mention: It’s probably not viable to call db.get_task_hashes (and make a request to the database) within the for loop, e.g. for each task. So you might want to add logic that only calls it every X seconds or every X tasks.
You could also move the hash checking logic into the update callback of your recipe that’s executed every time new answers are received (before they are stored in the database). The stream generator can respond to external state – for example, a global IS_DONE that is set to True by your update callback as soon as the new incoming hashes + the existing hashes = all hashes. In that case, the stream loop would break and stop sending out questions. With a low batch_size, you should get very little to no overlap here.
By default, it doesn’t – but you should be able to set rehash=True when calling get_stream!