Refresh browser fix with force_stream_order

Hi @ines In regards to the Example 1 problem, I think I have ruled out my hypothesis about the small dataset cause. Even if I create a new dataset and then add a couple of the annotations from the previous dataset, it still breaks it. It definitely has something to do with the annotations from the old dataset. If I start off with a new dataset and perform some annotations, it works correctly, but as soon as I add some of the annotations from the old dataset to the new dataset using db.add_examples(old_dataset[:2], datasets=("new_dataset",)), it breaks it again.

However, not all of the old annotations cause problems. A couple times I added old annotations it did not break it. So, I have no idea what the problem is, but I hope this narrows it down for you.

FYI, I ported over the posts from the other thread to keep the discussion in one place (otherwise it's harder to keep track of the reports, updates etc.)

Here's @justindujardin's latest update again so it doesn't get buried:

Next, we'll investigate the issue with existing datasets and why some examples in the pre-existing dataset cause examples to be re-sent, while others don't :thinking:

Thanks for your continued efforts on this bug! I wanted to let you know I'm encountering duplicates even on a fresh dataset, so it doesn't seem to be isolated to just pre-existing datasets. I'm setting force_stream_order=True and feed_overlap=True. I do have annotators working at the same time though if that's related.

@justindujardin when's the next release?

The fix is under review, and I don't know when the next release will be available. Also, I'd like to not get in the habit of committing to dates on the forums. :bowing_man:

I expect it will be ready in about as much time as it takes for other bug fixes. We'll be sure to update this thread when it's released. :slight_smile:

@snd507 From another thread, can I ask why you thought that? Is it the usage of feed_overlap=False that gives you that expectation?

One behavior change we are looking at for the combination of force_stream_order=True and feed_overlap=False is to have the server filter out the duplicates, only accepting the first annotation for an input. This would lead to there being only one of each example in the database, but would mean that only the first annotation is accepted, even if the second one was better. Would this behavior meet your expectations?

Yes, ideally we wouldn't want replayed tasks but that won't be an issue if only the first annotation is saved, and subsequent answers ignored if they're exist already.

1 Like

I got lost - is there a config that works as expected on 1.9.10?

I just did the following:

  1. Started annotating a new dataset using a custom recipe in a named session. force_stream_order: True
  2. Annotated several hundred examples without any problems.
  3. Realised I need to make a small change to my recipe that does not affect the input hash specified in exclude_by
  4. Saved the annotations in the Prodigy UI
  5. Stopped server, modified the recipe, re-started the server
  6. Examples seems continue from where I left off as expected and so I continue my annotation
  7. 24th example as counted in "This Session" in Prodigy UI is the same one as first one in This Session, i.e. where I continued at step 6.

I tried to keep at least 1s between pressing/clicking Accepts - I don't know what value here makes a difference.

@snd507 @cgreco @geniki @dshefman @Kairine Thanks for all of your reports! Before the next release, I'd like to get some verification that the latest fixes resolve your problems. Since we don't have clear reproduction cases for each of your problems, I'd like to ask you to test out a preview build of v1.10. If you're willing to help out, send me a private message or let me know here and I'll send you a link. Thanks! :bowing_man:

@justindujardin Sure thing.

1 Like

@justindujardin I'd be happy to help.

1 Like

@justindujardin Sure, let me know if I can still help!

1 Like

No problem at all !

1 Like

Re :smile: So I tested the latest release v1.9.10 and the beta v1.10. Here's some feedback:

1/ With both versions, Prodigy stops sending tasks once a batch is finished. I had to refresh the browser in order for it to fetch new ones. Is this an expected behavior ? Not that it's an important matter; I just want to be sure because in previous versions this transition was automatic.

2/ With v1.9.10, I did the following experiment: built-in pos.correct recipe, 60 examples, 2 sessions, batch size to 10, feed_overlap=false and force_stream_order=true. The expected scenario was for session1 to start with example1 and session2 with example11. It still didn't work, both sessions started with example1. I clicked through both sessions while switching between them every once in a while, and I ended up with 63 annotations in the database. Duplicates still exist but are indeed rarer than before.

3/ With v1.10beta I put up the exact same workflow. There were still some duplicates in the tasks, but none of them were saved into the database. I had exactly 60 annotations. For me this did solve the problem. We might have spent a little extra time when annotating, but the results are clean. Great workaround !

1 Like

Great to hear about the 1.10 release. Is this bug fixed there? Based on @Kairine's report, it seems not?

This is not the expected behavior and hasn't been reported by other users that tested 1.10. If you have to refresh the page to get each batch of questions, that's a bug, and you should create a new thread with reproduction info so we can help you resolve it.

You should not need to put any specific delay between answering questions with the latest version.

Please try the build for yourself and see if your specific problems are solved. The original issue in this thread has been solved and confirmed by multiple users. If you have another issue, please open a new thread and include reproduction steps. :pray:

@snd507, @dshefman, @cgreco, @Kairine thanks for helping test the preview build and for confirming that the latest version mitigates the duplicates that were ending up in your db when using force_stream_order.

1 Like

The duplication error with named sessions (?session=user) still happen!
Using config "force_stream_order": True, "feed_overlap":False

custom recipe: (very basic simple recipe) stream is a generator yielding one example at a time.

I don't want to have to write workarounds for something that should be basic functionality
[feed_overlap bug?]
Version 1.10.1

Hi @snu-ceyda, welcome! Sorry to hear you're having trouble, please try the latest version 1.10.2, which was released today.

Unfortunately I still get duplicates with v1.10.2. even after adding hashes to my stream, which yields objects like the following;

{'text': 'jasdnja  njkadf.',
'image': 'smt.jpg',
'_input_hash': -2106113696,
'_task_hash': -427767885,
'meta': {'pattern': '2197'}}

hashes are correctly generated.

@snu-ceyda thanks for the follow-up. Since you're using a custom recipe it's a bit hard to tell where things are going wrong. I'm sure it's a simple recipe as you say, but can you provide me more information to help understand it?

What kind of tasks are you annotating? Are they text-based, images, or both? Are you using the prefer_uncertain, prefer_high_scores or prefer_low_scores functions in your recipe?

Also, when you see duplicate tasks shown in the frontend, do multiple entries end up in the database after answering them? You can test this by annotating a few duplicates, then saving and exporting the database table to see if the answers are there twice.

EDIT: I found an issue where duplicates were not filtered if you use exclude_by=input with 1.10.2. If you're willing to try out a beta build that fixes this issue to see if it solve your problem, send me an email at justin@explosion.ai