I have a custom recipe that does some amount of data transformation (e.g. creating multiple examples from one input) but by having it print everything just before the
yield statement, it seems to generate the examples I want, each with unique input and task hashes.
However, when I start annotating I see duplication. With default
batch_size 10 and if I start indexing the examples I see from the
print just before
yield task from 1, the pattern is
1, 2, 3,.... 22, 23, 1, 2, (and I stop tracking).
batch_size 1, the pattern is
1, 2, 3, 1, 2, 3, 4, 5, 6, 4, 5, 6, ... (i.e. it seems every 3 examples are shown twice and then we move on to the next 3)
db-out a dataset generated by this, the resulting jsonl has the duplicate entries - the whole lines are exactly the same, including the hashes (assuming same annotation on both passes).
exclude_by is left/set to
task, which I thought should not allow the above to occur.
I'm using Prodigy 1.9.9, default DB.
@geniki Could you share the code that generates the examples and your recipe config settings? It's otherwise a bit hard to help debug this, because there could be a lot of explanations.
@cgreco thanks, I missed your thread and the related one here Refresh browser fix with force_stream_order.
As suggested in that thread, I tried setting
False and that removes the duplication but it turns off an equally important feature.
Unlike the others, I was not using a named session. I tried with a named session and got the same results.
@ines it's not always quick to create a new recipe with the same logic and that is shareable on a public forum. I thought confirming that I'm happy with the output of the recipe from it's
yield statement is a good checkpoint. I'll try to create a self-sufficient case but at the moment it seems that the problem is more general given the other threads.
I was able to reproduce duplicate examples seen in the client using
force_stream_order=True, and I'm debugging it.
Thanks for reporting, I'll update this thread once I've found the root cause and a fix.
Just released v1.9.10, which should fix the underlying problem with
force_stream_order (explained in detail by @justindujardin in this post). The only case where a glitch may still be possible with the current implementation is if you hold down a hotkey and rapid fire – but that should also be a pretty unusual scenario.